Anda di halaman 1dari 41

PRINCIPLES OF

STATISTICS
IPTHO
ABDUL SUKUR KAMSIR
OUTLINE
Introduction and Definitions
Sampling
External and Internal Validity
Sources of error
Normal distribution
Standard error
Binomial probabilities
Poisson distribution
Statistical tests
Confidence intervals
What is statistics?
A Science where inferences are made
about specific random phenomena on the
basis of a relatively limited sample
material
The word “statistics” can also mean the
analytical tools used in this science i.e. the
calculated figures based on the data
collected.
Two Main Areas
Mathematical statistics – concerns with
the development of new methods of
statistical inference and requires a strong
mathematics knowledge
Applied statistics – applying the methods
of mathematical statistics to specific
subject areas; BIOSTATISTICS is when it
is applied to biological or medical
problems
Definitions
Statistics - Collection of methods for
planning experiments, obtaining data, and
then organizing, summarizing, presenting,
analyzing, interpreting, and drawing
conclusions.
Statistic - Characteristic or measure
obtained from a sample e.g mean,
variance, Chi-square statistic, t-test
statistic etc
Definitions
Inferential Statistics - Generalizing from
samples to populations using probabilities.
Performing hypothesis testing,
determining relationships between
variables, and making predictions.
Descriptive statistics refers to the process
of organizing and summarising collected
information (data) to study the properties
of a variable
Definitions
Population - All subjects possessing a
common characteristic that is being
studied.
Sample - A subgroup or subset of the
population.
Parameter - Characteristic or measure
obtained from a population.
Statistic - Characteristic or measure
obtained from a sample.
Definitions
Variable - Characteristic or attribute that
can assume different values, it is the
fundamental element of statistical
analysis; it is something
measured/counted or identified
Variables
Qualitative - Variables which assume non-
numerical values.
Quantitative - Variables which assume numerical
values.
Discrete - Variables which assume a finite or
countable number of possible values. Usually
obtained by counting.
Continuous - Variables which assume an infinite
number of possible values. Usually obtained by
measurement.
Variables (qualitative/categorical)
Nominal Level - Level of measurement
which classifies data into mutually
exclusive, all inclusive categories in which
no order or ranking can be imposed on the
data.
Ordinal Level - Level of measurement
which classifies data into categories that
can be ranked. Differences between the
ranks do not exist.
Variables (quantitative/numerical)
Interval Level - Level of measurement which
classifies data that can be ranked and
differences are meaningful. However, there is no
meaningful zero, so ratios are meaningless.
(temperature in celcius, fahrenheit etc)
Ratio Level - Level of measurement which
classifies data that can be ranked, differences
are meaningful, and there is a true zero. True
ratios exist between the different units of
measure. (temperature in Kelvin)
SAMPLING
Random - Sampling in which the data is collected using chance
methods or random numbers.
Systematic - Sampling in which data is obtained by selecting every
kth object.
Stratified - Sampling in which the population is divided into groups
(called strata) according to some characteristic. Each of these strata
is then sampled using one of the other sampling techniques.
Cluster - Sampling in which the population is divided into groups
(usually geographically). Some of these groups are randomly
selected, and then all of the elements in those groups are selected.

(all the above methods have a known probability function)

Convenience - Sampling in which data is which is readily available


is used, the probability of being selected as a sample is unknown.
Why sample?

Cheaper than getting data from everyone!


Internally valid study because:
-easier to manage
-standardise methods
-easier to conduct
-less people involved
Good sampling method may ensure
external validity!
Beware when sampling
• Two samples have
been taken at random
from the same
population
• By chance, sample 1
contains a group of
relatively large fish
while sample 2 are
relatively small
• You might mistakenly
conclude that the two
populations are very
different

* Even a random sample may not be a good


representative of the population from which it
has been taken
• Samples selected
at random from
very different
populations may not
be different.
• Simply by chance
sample 1 and
sample 2 are similar

• Even if two populations are very different, samples


from each may be similar
• The misleading impression – the populations are
similar
* Natural
variation among
individuals
within a sample
may obscure
any effect of an
experimental
treatment

• Two samples of equal-size fish were taken from the same population
• One group fed with vitamin supplement diet for 300days & the other
untreated control group
• The supplement diet caused 10% increase in length but the difference
is small compared with the variation in growth among individuals
which may obscure any effect of treatment
Because of the natural variability
among living species:

A ‘true’ difference may not be apparent


The effect of treatment may not be
apparent after a clinical trial
How to solve this unavoidable
problem in Life Sciences?
Researchers need to know how
to sample to ensure you have a
good representative sample of
your population.

They also need a good


understanding of experimental
design, because a good design
will take natural variation into
account.

Know how to minimise


additional unwanted variation
introduced by the experimental
procedure itself.

Need to take accurate and


precise measurements to
minimise other sources of error.
EXTERNAL VALIDITY
External validity is the extent to which the
results of a study are applicable to
OTHER populations
“Can my results be extrapolated to
others?”
INTERNAL VALIDITY
Internal validity is the extent to which the
results of an investigation accurately
reflects the true situation of the study
population
“The ability to measure what it sets out to
measure”
Avoids BIAS or SYSTEMIC ERRORS
Sources of ‘errors’

Bias – a systematic error that can


lead to a distortion of the results;
“deviation from truth”
Confounding
BIAS
Selection bias (non random sample,
healthy worker effect etc)
Information bias (measurement
inaccuracy, misclassification, recall bias,
interviewer bias etc
Performance bias may occur in
multicentre studies
Confounding
Mixing of the effect of an extraneous variable
with the effects of the exposure and disease of
interest
For a variable to be considered a confounder, it
must satisfy two conditions i.e. (a) has an
association with the outcome of interest and
(b) is also independently associated with the
exposure (NOT a result of being exposed)
Confounding
Occurs when groups being compared are
different with regards to important risk or
prognostic factors other than the factor
under investigation
Certain study designs are prone to
confounding i.e. case control
Mann JI et al (1968). Oral contraceptive use in older women and fatal
myocardial infarction. Br Med J 2: 193-199

153 women with myocardial USER Non


infarction (MI) USER
178 controls cases
39 114
Past exposure to oral
controls
contraceptives (ocp) were 24 154
investigated O.R. = 2.2
Second table is stratified
according to age Age <40 Age 40-44
Note that OR became higher User Non User Non
user user
The confounder ‘age’
weakened the relationship Cases 21 26 18 88
between MI and ocp
Controls 17 59 7 95

O.R. 2.8 2.8


COULD IT BE DUE TO CHANCE?
Type I and Type II errors
(will be explained in other lectures / slides)
Could errors have been
introduced?
Susceptibility (?differences in basic
characteristics)
Performance (e.g. differences in
proficiencies of treatment)
Detection (differences in measurement of
outcome)
Transfer (differential losses to follow-up)
The Normal Distribution
Theoretical distribution that has the shape of a
bell-shaped curve
Perfectly symmetrical about its centre
(mean=median=mode)
Standard deviation reflects the spread of
individual observations; 68% of the observations
are located I std deviation from the mean.
We can thus estimate the area under the curve
for any value of the variable once we know the
mean and standard deviation of the distribution
Normal distribution curves
Normal distribution
Many other distributions e.g. binomial
probabilities, Poisson approximates the
normal distribution under certain
conditions
The advantage of this normal
approximation is that standard probability
tables for the normal distribution can be
used for binomial problems or Poisson
distributions.
Standard Error
Spread of observations in one experiment yields
a single mean and standard deviations
Repeated sampling from the same population
will result in a normal distribution of means with
a ‘grand’ mean (mean of means) and a spread
called STANDARD ERROR
Standard error = standard deviation ÷√n where n
is the sample size
Sample size and variability of measurements
determine magnitude of standard error
Used to construct confidence intervals
Binomial probabilities
The distribution curve has a mean (M)=np and
standard deviation (S)=√(npq) where n is the
number of trials; p is the probability of outcome
A and q is the probability of outcome B.
Refers to situations where there are 2
alternatives (success/failure; black/white;
heads/tails; alive/dead etc) i.e. p+q=1
Used to determine whether results observed in
trials/experiments would have occurred
randomly
Normal approximation to the
binomial probability distribution
Large n (number of trials)
p not close to 0 (probability of occurrence
is not rare)
Product of np>5
Example of a problem to be solved
by using binomial theory / method:-
What is the probability of the number of
successes (As) in the sample of n trials?
Formula to calculate the probability
n
Cr . Pr . (1-p)n-r
where r is the number of As (successes),
n is the number of trials and nCr is known
as the binomial coefficient.
Description of binomial problem in a later
lecture.
Poisson Distribution
Useful for calculating probabilities of rare events
No a priori estimate can be made of the
probability that the event will occur
Pr (n) = e-m mn
n!
where Pr (n) is probability of occurrence of n such events, m is the
mean number of events, e is a mathematical constant and n! is
factorial of n
Mean and the standard deviation in a Poisson
distribution = m
Poisson Probability Distribution
Must meet 4 conditions i.e.
Discontinuous (discrete) data
Chance of a result is SMALL
Chance of a result is independent of
previous results
A large number of tests can be performed
Approaches normal with mean m and
standard deviation √m for mean
values>30
Statistical tests
Used for testing of statistical significance
To decide on accepting or rejecting your
null hypothesis
Type of test you use depends on type of
data as well as whether the data
approximates the normal curve
General Guide to select an appropriate statistical
test in univariate analysis
Number Independent variable Dependent variable Parametric test Non – Parametric test
of groups

two (Independent ) Categorical Categorical - Chi – square test


(e.g. smokers and non- (e.g. CHD and no CHD) Fisher’s Exact test
smokers)
Categorical Categorical - Mantel Hanszel test ( if a third variable is
(e.g. smokers and non- (e.g. CHD and no CHD controlled for ) (e.g. age group is
smokers) controlled for)
Categorical Numerical
(e.g. smokers and non- (PEFR level) Independent t test Mann – Whitney test
smokers) (Wilcoxon Ranked Sum test)

two ( Dependent ) Categorical Categorical - Mc Nemar test


(e.g. pre-intervention (e.g. behavioral
and post-intervention) changes)
Categorical Numerical Paired - t test Wilcoxon Signed Rank test
(e.g. pre-treatment and (e.g. blood pressure)
post-treatment)

>two ( Independent) Categorical Categorical - Chi-square test


(e.g. race) (e.g. diabetic and non-
diabetic)
Categorical Numerical One – way ANOVA Kruskal – Wallis test
(e.g. race) (e.g. blood sugar level)

two Numerical Numerical Pearson correlation & Spearman correlation


(e.g. height ) (e.g. weight) Simple Linear Regression
Confidence Intervals (CI)
Estimating where the “true” population
parameter is believed to be found within a given
level of confidence (95%, 99% or more)
Parameters of interest are usually means,
proportions, difference between means or
proportions, regression coefficients, correlation
coefficients, relative risks
CIs extremely useful in assessing clinical
significance of a given result
95% CI=sample estimate± 1.96 x SE
THANK YOU

Anda mungkin juga menyukai