Anda di halaman 1dari 93

Good morning

Biostatistics
Ashfaq yaqoob
18.01.2010

Introduction
Any science needs precision for its development.
For precision, facts, observations or
measurements have to be expressed in figures.
It has been said when you can measure what
you are speaking about and express it in
numbers, you know something about it, but
when you cannot express it in numbers your
knowledge is of meager and unsatisfactory
kind. - Lord Kelvin

Similarly in medicine, be it diagnosis, treatment
or research everything depends on measurement
E.g. you have to measure or count the number of
missing teeth OR measure the vertical
dimension and express it in number so that it
makes sense.
Statistic or datum means a measured or
counted fact or piece of the information stated as
a figure such as height of one person, birth
weight of a baby etc.
Statistics or data is plural of the same.
Statistics is science of figures.

It is a field of study concerned with
techniques/methods of collection of data,
classification, summarizing, interpretation,
drawing inferences, testing hypothesis and
making recommendations.

Biostatistics- is term used when tools of
statistics are applied to the data derived from
biological sciences.
Data discrete observations of
attributes/events that carry little meaning
when considered alone.
Information is data which is reduced and
adjusted, according to variations such as age
sex-so that comparisons over time and place
are possible.
Intelligence is transformation of
information through integration and
processing with experience and perceptions
based on social and political values.
Any measurable characteristic of a
population is called a Parameter.


Statistics used to summarize, or describe,
the characteristics of a sample are called
Descriptive statistics.

Statistical procedures that are used to make
inferences (ie, draw conclusions) about the
population that the sample represents are
called Inferential statistics.
Descriptive statistics
In the real world, we can not study the infinite
members of an entire population.
Instead, we must select a sample in the hope
that it will serve as a representative surrogate.

sample -can be used to estimate quantities in a
population as a whole


Sampling variations minimized by
adequate sample size
proper sampling techniques

Non random sampling easier and more
convenient to perform

Random sampling .


In random sampling (also called probability
sampling)


everyone in the sampling frame has an equal
probability of being chosen.


Non-random sampling (also called non
probability sampling) does not have these aims,
but is usually easier and more convenient to
perform.

Convenience or opportunistic sampling is the
crudest type of non random sampling.

This involves selecting the most convenient
group available (e.g. using the first 20
colleagues we see at work).
Though simple to perform, but is unlikely to
result in a sample that is either representative of
the population or replicable.
Random selection of samples is important

In random sampling, everyone in the sampling
frame has an equal probability of being chosen.

sample is truly representative of the population

It can help minimize bias (bias can be defined as
an effect that produces results which are
systematically different from the true values )


Simple random sample using random numbers.
a. lottery method
b. Table of random numbers.


Multi stage sampling :school health survey all
children-.

Cluster sampling -all of the subjects in the final-stage
sample are investigated.

Stratified sampling - to randomly select subjects
from different strata or groups.



Systematic sampling is formed by selecting one
unit at random and then selecting additional
units at evenly spaced interval till sample of
required size is formed.
Pathfinder surveys:specified proportion of
population.1%
Sources of data
1. Experiments
2. Surveys
3. Records

Primary
Secondary



Categories
1. Quantitative/continuous measured with a number
2. Qualitative/discrete- cannot be meaningfully
summarized by a number.

Qualitative or discrete data
In such data there is no notion of magnitude or
size of an attribute as the same cannot be
measured.
The number of person having the same
attribute are variable and are measured
e.g. like out of 100 people 75 have class I
occlusion, 15 have class II occlusion and 10
have class III occlusion.
Class I II III are attributes , which cannot be
measured in figures, only no of people having
it can be determined

Quantitative or continuous data
In this the attribute has a magnitude. both
the attribute and the number of persons
having the attribute vary
E.g Freeway space. It varies for every patient. It
is a quantity with a different value for each
individual and is measurable. It is continuous
as it can take any value between 2 and 4 like it
can be 2.10 or 2.55 or 3.07 etc.
Data presentation
Statistical data once collected should be
systematically arranged and presented
To arouse interest of readers
For data reduction
To bring out important points clearly and
strikingly
For easy grasp and meaningful conclusions
To facilitate further analysis
To facilitate communication

Two main types of data presentation are
Tabulation
Graphic representation with charts and
diagrams

Tabulation
It is the most common method
Data presentation is in the form of columns
and rows
General principles for designing tables.

1. Tables should be numbered.
2. A title- brief and self explanatory should be given for
each table.
3. Headings of rows and columns should be clear and
concise.
4. Data must be presented according To size or
importance (chronologically/ alphabetically).


It can be of the following types
Simple tables
Frequency distribution tables
Simple table
NO of patients in MCODS Mangalore
Jan 2006 2000
Feb 2006 1800
March 2006 2300
Frequency distribution table
Data is first split into convenient groups and
number of items in each group is shown in
adjacent columns.
Frequency distribution table
Number of Cavities Number of Patients
0 to 3 78
3 to 6 67
6 to 9 32
9 and above 16
Charts and diagrams

Useful method of presenting statistical data
Powerful impact on imagination of the people

Bar chart

Length of bars drawn vertical or horizontal is
proportional to frequency of variable.
suitable scale is chosen
bars usually equally spaced
They are of three types
-simple bar chart
-multiple bar chart
two or more variables are grouped together
-component bar chart
bars are divided into two parts
each part representing certain item and
proportional to magnitude of that item
Bar diagrams


Simple
Sub-divided
Multiple
Simple
Sub-divided
Multiple
Histogram
-Pictorial diagram of
frequency distribution .

Frequency polygon
obtained by joining
midpoints of histogram
blocks at the height of
frequency by straight
lines usually forming a
polygon

75
45
40
32
43
22
34
29
38
0
10
20
30
40
50
60
70
80
Number of carious lesions
0 to 3
3 to 6
6 to 9
9 to 12
12 to 15
15 to 18
18 to 21
21 to 24
24 to 27

Pie charts
In this frequencies of the group are shown as
segment of circle
Degree of angle denotes the frequency
Angle is calculated by




class frequency X 360
total observations


200, 31%
150, 24%
180, 29%
70, 11%
30, 5%
PROSTHO
CONSO
PERIO
ORTHO
PEDO
Scatter diagrams: show relation between two
variables.
If dots are clustered around a straight line-
shows evidence of relationship of linear nature.
If no such cluster- it is probable that no relation
between variables.

0
2
4
6
8
10
12
14
0 5 10 15
Carious lesion
Sugar Exposure
Pictogram
Popular method of presenting data to the
common man

Spot map or map diagram
These maps are prepared to show geographic
distribution of frequencies of characteristics
Implies a value in distribution around which
other values are distributed.
Gives a picture of central value.
1. Arithmetic mean
2. Median
3. Mode
Measures of statistical averages or
central tendency


Mean
refers to arithmetic mean
it is the summation of all the observations
divided by the total number of observations (n)
denoted by X for sample and for population
X = x1 + X2 + X3 . Xn / n
Advantages it is easy to calculate
Disadvantages influenced by extreme values
Median
When all the observation are arranged either in
ascending order or descending order, the middle
observation is known as median
In case of even number the average of the two
middle values is taken
Median is better indicator of central value as it is
not affected by the extreme values

Mode
Most frequently occurring observation in a data
is called mode
Not often used in medical statistics.
Example
Number of decayed teeth in 10 children
2,2,4,1,3,0,10,2,3,8
Mean = 34 / 10 = 3.4

Median = (0,1,2,2,2,3,3,4,8,10) = 2+3 /2
= 2.5

Mode = 2 ( 3 Times)

Variations
Data colleted has incredible variations.

Variation from person to person And also
variation in same person at different times.

Thus Measures of variation / dispersion are
used.
Range
Mean/average deviation
Standard deviation (sigma )
Range difference between highest and lowest
values

Mean deviation -average of deviation from
arithmetic mean.
M.D.= (X-X
1
)/n
X 1= observation
X = mean
n = no of observation

Standard deviaiton root mean square
deviaiton.
Denoted by (sigma) or S.D
= (X-X
1
)
2
/

n

Greater the standard deviation, greater will be
the magnitude of dispersion from mean
Small standard deviation means a high degree of
uniformity of the observations
Usually measurement beyond the range of 2
SD are considered rare or unusual in any
distribution




Variance of the data
Another way to describe dispersion is to
present interquartile ranges, such as the
values for the 25th and 75th percentile level,
which are not as likely to be influenced by the
values at the extreme upper and lower end of
the spread of data points.

For continuous data, the most commonly used
measure of central tendency is the mean.


For ordinal data, the median or mode is used to
represent the center of the data.


The median is also used as a measure of central
tendency for continuous data that are skewed to
minimize the effect of extremely large or small
values on the estimate of the center of the data.

Nominal data are summarized by reporting the
proportion or percentage of the data that are
classified in each level.


Sample Size and Power

Designing studies with inadequate sample sizes
may lead to errors and false conclusions (false
negative findings)

False negative findings can occur either by
chance or study is under powered.

Care full sample size calculation can guide
researchers as to what can and cannot be
accomplished in a study with a finite amount of
resources .
Although the sample size calculations are
performed using mathematical methods, the
preparation for the calculation requires both
statistical reasoning and clinical experience.

Calculation of sample size require four things
1. Deciding on the design of study
2. Assessing the availability of resources
3. Specifying distribution assumptions
4. Defining a clinically relevant effect



Inferential statistics
Inferential statistics are those statistical
procedures that compare groups to see if the
groups are significantly different from each
other.
two kinds
parametric statistics
nonparametric statistics.

Parametric statistics refers to a group of
statistical tests that uses means and a measure of
variation (standard deviation, variance) to help
determine if groups are different from each
other.

Certain conditions regarding the data must be met
before the simplest parametric tests, based on means
and standard deviations, may be validly used.


1. The data must be continuous (measured on a
continuous scale, eg, millimeters, pounds, degrees)

2. A scatter plot of the data must look like a normal
distribution (bell shaped curve) and

1. The dispersion or spread of data for each variable
must be the same in each group being compared (the
size of the variance or standard deviation of the
variable is the same in each of the groups being
compared).

Distributions
Begin the initial analysis by plotting them on a
graph to see how they are distributed.
points can be seen to follow some recognized
pattern or distribution.
Many patterns of distributions occur in nature.
Frequently, these patterns can be described by
mathematical functions, which then enable us to
determine the likelihood that a data point will
fall under a specific area of the distribution
curve.
The Normal distribution or Gaussian
distribution.
Bell - shaped curve
The data cluster around a central point and
spread symmetrically around this center point.
the central point is the mean of the sample.
The width of the bell-shaped curve depends on
how much variability there is in the data.
The way to estimate the amount of variability is to
calculate the SD, the square root of the average squared
deviation of each data point from the mean value of all the
data points.
The larger the SD is, the greater the variability in the data.
The greater the variability is, the wider the shape of the
curve.

Importance of distribution
Many statistical tests are based on parametric assumptions
(ie, the data are assumed to follow a distribution that can be
summarized by parameters) requiring distribution of the
data which is normal (bell-shaped).

Many parametric statistical tests are insensitive to mild
departures of the data from normality, but severe
departures from the normal distribution mandate the use of
distribution-free tests- nonparametric statistics.

Parametric statistics tend to be more powerful
than nonparametric statistics.

This means that they are more likely than
nonparametric statistics to detect a significant
significance between samples when the
difference is real, but use of a parametric test
when assumptions are violated is incorrect.

Common parametric tests include the
Student t test and
Analysis of variance (ANOVA)

Ordinal data are analyzed by nonparametric
procedures.
Nonparametric statistics use the ranks/ medians of the
data rather than means and standard deviations to
make group comparisons.
Common nonparametric tests based on ranks include
the Mann-Whitney U test,
the Wilcoxon signed rank test, and
the Kruskal-Wallis test

Nonparametric statistical tests are also used for
continuous data that are not normally distributed
(bell-shaped curve).


The most common test to analyze nominal data is the
2

test

Data that are nominal (eg, sex, tooth type) cannot be
summarized by means or ordered into ranks.
Ratios / proportions can be determined.


Test Statistics
Statistical procedures comparing samples provide a
test statistic or critical ratio that is associated with a
probability level (P value).
The probability level, is the likelihood or chance that
two groups, representative of the same population,
would be chosen, and that there would be a
difference in the groups at least as big as the one
detected.
P value < .05 means there is an equal or lower than
5% chance (1 in 20) that the two groups could be
samples from the same population.
By convention, when P < .05, it is concluded that
groups are not from the same population and
therefore are statistically, significantly different.

Parametric Tests
The Student t test is used when only two groups are
being compared.

The Student t test uses sample means and standard
deviations to calculate the probability or likelihood that
the groups are different.

It helps us to determine if the means differ because the
two groups represent two different populations or if the
means differ because the groups have different subjects
but each group represents the same population.
exists in two forms depending on whether the
two groups under comparison are
paired (matched) or
independent of each other.

A common paired design occurs when a single group of
subjects is measured before and after a procedure to
examine the effect of some intervention (eg, treatment).
A matched group study design is one in which the
outcome of each subject in the treatment group is
compared directly to the outcome in another subject who
is as similar as possible to its mate, with the exception of
the treatment under investigation.

An example of a paired study is a comparison of
masticatory efficiency of complete denture
wearer with bilateral balanced occlusion after
selective grinding.
Two -sample, independent t test.
to compare independent groups or unmatched
groups.
An example is to estimate the masticatory
efficiency between bilateral balanced occlusion
and lingualised occlusion in complete denture
wearers patients.

In paired study designs, the number of subjects
in both groups is the same, whereas in the two-
sample, independent design, the size of the two
samples may be different.

If more than two groups are being compared, the
ANOVA is used.

Unlike the t test, which uses the mean and standard
deviation of groups for its computations, ANOVA
uses the mean and variance of groups for
computations.

Test statistic is F statistic.

ANOVA makes a series of pair-wise comparisons for
all the groups in the comparison.
A significant P value indicates that a difference exists
somewhere between any two comparisons, but ANOVA
does not identify which groups are different.
To determine which pairs differ post hoc or a posteriori
tests used to examine the groups in detail and reveal
which groups significantly differ from each other.
Common post hoc tests are
the Tukey-Kramer honestly significant difference,
Scheff, Dunnett, Duncan, and Newman-Keuls tests.


Nonparametric Tests
A common nonparametric test for
comparison of two unpaired samples is the
Mann-Whitney U test also known as the
Wilcoxon rank sum test.
Compares the medians of the groups.
Test statistic is U statistic.
Example -grade point averages
The comparable nonparametric test to the
paired t test is the Wilcoxon signed rank test.
The nonparametric test comparable to the ANOVA is the
Kruskal-Wallis procedure.

Examines intergroup differences based on ranks.


x2 test.
nominal data analyzed.
It is used to compare the proportion of the data
that fall into each level of the nominal variable.




Correlation.

To test whether or not two variables bear a linear
relationship to each other (ie, whether or not they vary
together, either positively or negatively), the technique
of Pearson product-moment linear correlation is
commonly used.

The correlation coefficient (r), a dimensionless index
indicates of the extent to which the two characteristics
vary together.,

r can range from +1, denoting a perfect positive
relationship, to 1, characteristic of a perfect negative
relationship,r = 0 signify complete independence.

normally r = 0.6 or -0.3 or 0.1


Regression.
If a linear relationship is significant statistically
and is strong enough to be of practical use, the
next step is to model it mathematically in the
form of a prediction equation so that it can be
used clinically.
Y = A + BX

Regression and correlation are closely related: one deals
with the strength of a linear relationship and the other
with its form.

Multivariate Analysis
A statistical analysis that involves more than
one dependent variable.
The analysis of simultaneous relationships
among several variables.
Examining simultaneously the effects of age, sex,
and social class on hypertension would be an
example of multivariate analysis
Considers the interrelationships of several traits
at a time .
Multivariate analysis comprises a set of
techniques dedicated to the analysis of data sets
with more than one variable.

One data set

Interval or ratio level of measurement: principal
component analysis (PCA)


Nominal or ordinal level of measurement:
correspondence analysis (CA), multiple
correspondence analysis (MCA)

Similarity or distance: multidimensional scaling (MDS)
- Multidimensional scaling (MDS) is a set of related
statistical techniques often used in data visualization for exploring
similarities or dissimilarities in data.


Two data sets
Case one: one independent variable set and one
dependent variable set-
Multiple linear regression analysis (MLR)
Regression with too many predictors and/or several
dependent variables
Partial least square (PLS) regression (PLSR)
Principal component regression (PCR)
Ridge regression (RR)
Reduced rank regression (RRR) or redundancy analysis
Multivariate analysis of variance (MANOVA)
Predicting a nominal variable: discriminant analysis
(DA)
Fitting a model: confirmatory factor analysis (CFA)





Two (or more) dependent variable sets:
Canonical correlation analysis (CC)
Multiple factor analysis (MFA)
Multiple correspondence analysis (MCA)
Procustean analysis (PA)
Regression analysis
In statistics, regression analysis is used to
model relationships between random variables,
determine the magnitude of the relationships
between variables, and can be used to make
predictions based on the models.
Predictor variables may be defined quantitatively or
qualitatively (or categorical).

If the predictors are all quantitative,- multiple
regression.

If the predictors are all qualitative, one performs analysis
of variance.

If some predictors are quantitative and some qualitative,
one performs an analysis of covariance
If two or more independent variables are
correlated, we say that the variables are
multicollinear.
Multicollinearity results in parameter estimates
that are unbiased and consistent, but which may
have relatively large variances
Many patterns of distributions occur in
nature. Frequently, these patterns can be
described by mathematical functions.
The most common statistical tests can be
applied to data that is normally distributed.
What if data obtained is not normally
distributed??
Log transformation of data to normal
distribution is undertaken.
Normal staistical tests cannot be applied to
data that is log transformed.
Logistic regression
In statistics, logistic regression is a model used for
prediction of the probability of occurrence of an event.


It makes use of several predictor variables that may be
either numerical or categories. For example, the
probability that a person has a heart attack within a
specified time period might be predicted from
knowledge of the person's age, sex and body mass index.
The "input" is z and the "output"
is f(z). The logistic function is
useful because it can take as an
input, any value from negative
infinity to positive infinity,
whereas the output is confined
to values between 0 and 1.
The variable z represents the
exposure to some set of risk
factors, while f(z) represents the
probability of a particular
outcome, given that set of risk
factors. The variable z is a
measure of the total
contribution of all the risk
factors used in the model and is
known as the logit
Z = 0 + 1x1 + 2x2 + 3x3 .
0 is the intercept value
it is the value of z when other risk factors are absent.

1, 2 and 3 are regression coefficient


X1,x2 and x3 are risk factor for heart disease

The application of a logistic regression may be illustrated
using a fictitious example of death from heart disease.
This simplified model uses only three risk factors (age,
sex and cholesterol) to predict the 10-year risk of death
from heart disease.


0 = 5.0 (the intercept)
1 = + 2.0
2 = 1.0
3 = + 1.2
x1 = age in decades
x2 = sex, where 0 is male and 1 is female
x3 = cholesterol level, in mmol/dl

Risk of death =1/1+e z where z = -5.0+2.0 x1 - 1.0 x2
+1.2x3
Discriminant Analysis
Discriminant function (modified Maddrey's
discriminant function)
originally described by Maddrey and Boitnott to predict
prognosis in alcoholic hepatitis.

canonical variate analysis attempt to establish whether a
set of variables can be used to distinguish between two
or more groups.

Suppose we have two samples representing different
populations,
We measured one character for them and found that
their means for this character are not identical, their
distributions overlap considerably, so that on the
basis of this character one could not, with any degree
of accuracy, identify an unknown specimen as
belonging to one or the other of the two populations.
A second character may also differentiate them
somewhat, but not absolutely
Two variables say Xl and X2 can be used to
distinguish them.



Discriminant function analysis computes a new variable
say Z, which is a linear function of both variables X1and
X2.
This function is constructed in such a way that as many
as possible of the members of one population have high
value for "z" and as many as possible of the members of
the other have low values, so that "z" serves as a much
better determinant of the two populations than does
variable Xl and X2 taken singly.
Example : Blood pressure and cholesterol levels
and blood sugar are different between those who
are obese and normal in body build.

Discriminant function analysis can be utilised
for assessing the combined effect of factors that
are different between the two groups of subjects.

meta-analysis
In statistics a meta-analysis combines the results of
several studies that address a set of related research
hypotheses.

The first meta-analysis was performed by Karl Pearson
in 1904, in an attempt to overcome the problem of
reduced statistical power in studies with small sample
sizes; analyzing the results from a group of studies can
allow more accurate data analysis.
CONCLUSION
Understanding the complexities of statistical
modeling not only enable the use of test
characteristics in the actual design of diagnostic
tests, but familiarity with fundamental concepts
will also facilitate insight and critical evaluation
of research that relies on such methodology.


Thank you

Anda mungkin juga menyukai