Anda di halaman 1dari 32

11

Health Information And Biostatistics


Muhammad Irfanullah Siddiqui, Lubna Ansari Baig, Nazir Ahmed Mohamed Iliyas,

Muhammad Irfanullah Siddiqui

Objectives
At the end of chapter students will be able to: 1. Dene biostatistics. 2. Describe concepts and uses with respect to biostatistics. 3. Classify data and its types. 4. Describe the methods and types of censes. 5. Describe the relation between ratios, proportions, percentage and rates. 6. Calculate and interpret crude, specic and standardized rates. 7. Describe the procedure of collection and registration of vital events in Pakistan. 8. Describe the sources of health related statistics. 9. Calculate and interpret Measures of central tendency, (Mean, Median, Mode) and Measures of dispersion (Range, variance, Standard deviation, Standard Error and coefcient of variance). 10. Dene Normal curve and use the information to solve community problems. 11. List different methods of data presentation (tables, graphs & diagrams) according to type of data. 12. Dene Hypothesis and its types including level of signicance and p value. 13. Apply appropriate test of signicance according to type of data and interpret the same with reference to (Z-test, t-test and Chi-square test). 14. Dene Sampling and its various techniques. 15. Define Health Management Information System. 16. Dene correlation and regression. 17. Interpret different values of correlation coefcient(r). 18. Find out the regression line for predicting value of dependant variables, given the value of independent variable. 19. Describe the situation for application of correlation analysis and regression analysis. 20. Enumerate the steps of hypothesis testing. 21. Describe and name various test of signicance. 21. Dene coefcient of determination (R2).

Health Information and Biostatistics

Appropriate and timely information is life line of an organizational set up. However medical students as well as graduates nd it difcult to perceive the uses and interpretation of the language being used in Biostatistics.

yardstick not only to measure the extent of the initial stages but also to measure the degree of achievement during the course of a program. Its use has now been extended to determine the health status of the population. From the above it is implied that we are basing our analysis of the problems of health and disease on a system approach, what is now termed as system analysis. The present trend in the Ministries and Departments of Health and planning is to have the system analysis approach in health planning dening the objective which must be achieved to solve considering alternative methods for meeting these objectives and choosing the most suitable in terms of cost analysis. System analysis may, however, not be possible in all situations in the eld of health, yet the approach to planning would be better than the one based on unsystematic judgment.

Definition
Statistics is dened as collection, organization, summarization, presentation, analysis and interpretation of health related data .1

Application of Statistical Knowledge (Uses)


Following are the main functions of health information in the eld of public health.

This is done through collection of reliable data relating to illnesses. This directs attention of the Health authorities to the existence of the problems and their extent. It allows comparison with the other public health problems so that necessary control and preventive programme could be extended and funds can be allocated accordingly.,

To define the nature and extent of illness in the community;

Background
It is customary that most of the medical students as well as graduate ask this question that why should we learn addition, subtraction, division and multiplication of the gures commonly considered as biostatistics and as the need is not rationalized they do not take interest in the subject and consider this as alien to medical eld2. However when it is discussed that do we need information for better planning and decision making, every one agrees to it. Now the question is what information is needed and how it should be presented. Let us take the example of a medical superintendent or a departmental head, if he/she is interested in improving the health status of the community by providing better services through hospital. Now there are two important questions needed to be answered. (Q1) what information should he get? (Q2) in which shape the information should be presented? So that one can easily understand the problem and formulate the strategies to solve it. Regarding the second question of presentation of data one really nds himself in great trouble. Consider the variable of age only. The collected information would be on several pages and that too would be in the following form if the data is in completed years 2, 60, 5, 59, 3, 30, 15, 11, 6, 8... and so on. If the whole sheets of information are presented to anyone

To find the causation of existence of the health problems;


After it has been dened that a particular problem exists in the community it becomes necessary to nd out the causation of that problem. Therefore, further information is collected from the community about its sickness to study various factors inherent in human population and the environment. Through analysis of data it becomes possible to ascertain the cause for the prevalence of the disease. Thus statistics give guidance to the health administrator to institute proper health measures.

After the programs have been instituted it becomes necessary to evaluate periodically to nd out whether or not the funds allocated are properly spent. Therefore statistics act as

To assess the success or failure of the health measures;

Health Information and Biostatistics he/she cannot make any meaning out of it and if the question is such that you should present the gist of information in a single word, then it becomes more difcult. History tells us that this question was answered by giving the average value of the data. Take the example of the following data of ages of the children. 2,3,4,4,4,5,6 <Data I> (4.1) Serial No. Value 1,2,3, 4, 5,6,7 2,3,4, 4, 4,5,6 (4.5)

There are total of seven values in the data, three above the central value i.e. 4,5,6 and three below it i.e. 2,3,4. Hence the forth value is the central most value. Incidentally the fourth value is 4 and this represents the central position of data. Scientists have given a formula to calculate the central value which is as follows: Median = n+1 2 (4.6)

If we add all the seven values and divide them by the total number of values then we would get a gure which would be representing the data as a whole, e.g. for Data I (4.1). 2+3+4+4+4+5+6 = 28 = 4 7 (4.2)

Here n is the total number of values. Calculating the median value by this formula for data I where the total number of values are SEVEN which is not an even number. Hence also the Median = n+1 = 7+1 = 8 = 4 2 2 2 (4.7)

This Figure 11 (commonly known as average) gives us a large number of messages which include, we are addressing a population of children and we may need under 5 clinic, child specialist, nurses and vaccines to solve the health issues of this age group and if we plan for the children of 4 years and slightly above that age and similarly slightly below that age, it would be considered as good planning. This value of average i.e. 4 was termed as Mean3 for that data and written as for sample and for population. Now consider the following situation when a person of 60 years is added to the Data I. Now data I becomes; 2,3,4,4,4,5,6,60 <Data II> (4.3)

So again the 4th value is median which is 4. (see 4.5) Median is most reliable because it avoids extreme values.

Important Point
It is important to note that median should be ascertained after arranging the data in ascending or descending order5. As for data I, the data is already arranged and we need not do this exercise. For the Data II, the central value would be calculated as follows Serial No. Value 1,2,3, 4,5, 6,7,8 2,3,4, (4, 4) 5,6,60 (4.8)

Calculate the mean or the Data II 88 = 2 + 3 + 4 + 4 + 4 + 6 + 60 = 11 8 (4.4)

If we plan using this information of mean i.e. 11 then what would happen to our planning? We would be considering grown up children entering in the adolescence group, in need of secondary school, playground, behavioural and social scientists etc. You can imagine that it would be a disaster to plan on mean in this situation as none of the child has the age of 11 years. So in this situation mean is not a term which should be used to draw conclusion about the data. Scientists after a lot of discussion gave solution to the problem by recommending that the central most value should be taken as representative because it will divide the data in two equal groups and will be more representative in this situation. It was termed as Median4. Take the example No. 1 in the Data I (4.1) 2,3,4,4,4,5,6 If we divide the data in two equal groups after arranging the data in ascending or descending order, it will appear like this:-

If we divide the data in two equal parts as shown in (4.8), there remains nothing in the centre as this is even number data. In this situation we take the two central value i.e. serial No.4 and 5 which have a value of 4 in both the cases encircled in 4.8 and take the average of the two i.e. (4+4)/2=4. Again in this case the median is 4. For the Data II, the median is more representative of data than mean. In many of the situations the medical personnel are more interested in knowing the most frequently occurring value than mean and median. Take the examples of frequency distribution of the diseases in the population. For a better planning we would like to know which disease is affecting the people most among TB, Malaria, Typhoid, Diarrhoea and the most frequently occurring disease (most common) will become our top priority to be dealt with and this has been labelled as Mode6. For the data I, the most

4
frequently occurring value is 4 as shown in 2,3,4,4,4,5,6 (4.9) The answer is No.

Health Information and Biostatistics

Similarly for Data II it is again 4 which is being repeated most of the time as shown in 4.10 2, 3, 4.4.4, 5, 6, 60 (The most fashionable of the series) (4.10)

After a lot of thought process it was decided that if we know the minimum and maximum value we can plan between minimum and maximum and the difference between minimum and maximum value was labelled as Range7. Consider the Data I (4.1) the range is 6 2 = 4 (Highest value lowest value) and we would implement plan for situation B. So range gives us idea for the decision about type of situation. Now consider the following data: 2, 40,40,40,40,40, 40,40,40,40,78 <Data III> (4.11)

Summary
In short we observed that the most common variable used by planner is mean for decision making and planning but sometimes it does not represent the data being presented as seen in data II. It is always good to nd out the median and mode to decide about the shape of the data and for application of principles of statistics. If all the three i.e. Mean, Median and Mode are same we can make very important inferences about data. These three together are known as Measures of Central Tendency.2 If all three i.e. Mean, Median and Mode (M,M,M) are same we can, in other words, say that most of the data, average and central most value are at the centre of the data e.g. 4 for Data I represent the central most value. As you know all this calculation is for the purpose of planning and decision making and once we get MMM as 4, what planning we should be doing for this type of data and what important decision could be made regarding the data? Here we can assume many things:a. All the children are of the same age b. Most of the children are of 4 years while some are of 2,3 years and 5, 6 years as well c. Most of the children are 4 years but some are from 0-1, 1-3, 3,4 and 4-5, 5-6,6-7, and 7 to 8 or more as well. For situation A, we would consider the need of 4 years children only, we would be planning for a paediatrician, nurseries, booster doses of diphtheria, tetanus, polio and other needs of 4 years children only. For situation B, we would allocate most of the funds for the age 4 years need but would spend some money for the age group 2,3,4,5 and 6 years as well but nothing or minimal for the infants (below l year) and would not buy the vaccines like BCG, DPT, Measles etc. For situation C, we will plan for each age group and the total budget will be distributed among all age groups including below l year and above 8 years. Now how we would know which should be the selected situation among A,B,C ? Is there anything in Mean, Median and Mode which could help us in deciding about the type of situation?

In this situation M,M,M all are same i.e. 40 but the range is 782=76 years so we will consider the range between 02 to 78 and would design a management plan of situation C while as evident by the data, Management plan A would have been better for this situation. Range is weak measures of spread of data as it accounts only two values and does not provide any information between central value i.e., mean and minimum and mean and maximum. Looking at data III, there would be lot of wastage of resources as no person between two years and 40 and 40 years and 78 years exist. However considering min and max in the planning we would be planning for all the age groups b/w 2 years and 78 years thus resulting in wastage of resources. Here range would be HV LV = R = 78 2 = 76. This is too wide a range to hep us in planning. It means that range does not answer the question of spread (scatter of data) in some of the situations. So, scientists kept on looking to nd out some gure which could provide a better idea about the spread of data than range. The deciency in range was that it was accounting only two values i.e. minimum (lowest) and maximum (highest) for the decision about the situation. It was decided that to conclude about the spread around mean, every value in the data should be considered. Hence we subtract each value from mean. If we denote each value by Xi and mean or average by , the equation will be Xi (4.12)

We apply this equation 4.12 to solve for data As the mean is 4 and the rst value in Data set I is 2 by applying Xi we get 2 4 = 2. Similarly for the value 3, 4, 5, 6, we get Xi as 1, 0, +1, and +2 respectively. We can observe that by subtracting each value from mean, we nd the values as shown in column 4 Table 11-1. To avoid the sign, we used the principle of squaring the values to get rid of the minus signs, as shown in column 4 of Table 11-1. Hence we obtain the squared difference of each value from mean.

Health Information and Biostatistics

5
dard deviation as S. However for the calculation of variance and standard deviation we divide the sum of the squares of the difference by the total number of values minus one (known as degree of freedom10) instead of total number of values. Hence the formula becomes

Table 11-1
Calculating the average deviation of each value from mean for data 4.1 Column 1 S. No.
1 2 3 4 5 6 7 Total

Column 2 Value Xi
2 3 4 4 4 5 6

Column 3 Xi
2-4 3-4 4-4 4-4 4-4 5-4 6-4

Column 4 (Xi )
-2 -1 0 0 0 1 2

Column 5 (Xi )2
4 1 0 0 0 1 4 10

S.D. = S2 =

/ (X - X )
i

n-1
2

(4.16)

S=

/ (X - X )
i

n-1

(4.17)

Applying the formula on the Data I, if supposing the data represents the sample from the population S=
^ Xi - X h2

We sum all the differences and then take average of the sum of the differences as shown in the equation (4.13). V=

n-1

10 = 7-1

10 = 1.66 = 1.3 (4.18) 6

r h2 ^ Xi - X N

= 10 = 1.44 7

(4.13)

You see we get slightly higher value of standard deviation for sample for example s=(1.3) than for population standard deviation = (1.2). Here the question is why we divide sample values by n-1 as compared to N in population value 7. It is done in order to accommodate the variation due to sampling differences. It is known as degree of freedom.10 Suppose total of certain values is xed as in case of Data I, (4.1) total is xed as 28. Now there are seven values in this data. The question is how much values we can assign by our own accord for the Data I. The values will appear like this; S.No.
1 2 3 4 5 6 7 Total 28 (4.19)

Here V= variance, = mean

= sum, Xi = each individual value N= total number of values

Here each value of Xi is rst obtained as shown in column 4 of Table 11-1 and then (sum) is obtained from adding all individual value of (Xi )2 at the end of column 5. As we are interested in calculating the average difference from Mean. This sum of squared difference is divided by total number of values (N) and we get a value which is known as the variance (4.13) Remember for planning and decision-making we are interested in a gure which tells us the average difference of each value from the mean but what we obtained in equation 4.13, the average of square of the difference so to have an average of the difference we need to reverse the process i.e. we would have to take square root of the value (variance) obtained in equation 4.13 and this is the gure in which we have all our interest and the most powerful tool in biostatistics and is termed as standard deviation8 (S.D.). (4.14)

Value

S.D. = V =

/(X - X )
i

(4.14)

Applying value from Table 11.2

We can put any value in data set at serial No.1 e.g. 3. The same is true for serial No.2,3,4,5 and 6 as well which are allocated values 4,5,4,2,6 respectively. The data will look like this; S.No. (4.15)
1 2 3 4

S.D. = V =

/(X - X)
i

10 = 1.44 = 1.2 7

Value
3 4 5 4

For the population value, variance9 is written as 2 while Standard Deviation is denoted by (read as Sigma)8. For the sample values variance is written as S2 and stan-

6
5 6 7 Total 2 6 28

Health Information and Biostatistics and coefcient of variance constitute Measures of Spread of Population.2

Normal Distribution
So far we have seen that by knowing mean (arithmetic average) and standard deviation we have good idea about the central position of data and the spread of values around mean which helps us in having some idea about the population. Let us have look on the following curves It is evident from the Figure 11-1 that all the curves I, II, and III have the same mean but have different spread. Curve I has least spread of values from mean and Curve III maximum spread around mean. These curves and their shapes are very helpful in planning and decision making. These curves use two statistics, mean and standard deviation,for important decision in health sciences. Let us suppose we collect blood sugar values of a very large number of people and make a frequency distribution with narrow class intervals, we are likely to get a smooth symmetrical curve. Such a curve is known as normal curve or normal distribution curve9,12. The shape of the curve will depend upon mean and standard deviation which in turn will depend upon nature and number of observations. While plotting the frequency distribution on the curve some interesting phenomenon was observed again and again and therefore later on used for decision in planning. The observation was that the area under curve from mean to one standard deviation is 34.13% on right side of the curve i.e. +1SD = 34.13 and the same % of area on the left side of curve from mean to 1 standard deviation i.e. 1S.D = 34.13 so total area 1 SD would approximately be 68.26%. Similarly area un-

Now can we assign any value by our own accord at S.No.7? The answer is NO. The only possible value at S.No.7 is 4. We can assign any value of data from S.No.1 to 6 but the 7th value is obtained by subtracting the sum of the six values from sum of all values. So in this situation we have n1=71= 6 degree of freedom to assign any value to data but the 7th one is xed. This obtained value of n1 is known as degree of freedom (d.f).

Coefficient of Variance (CV)


The standard deviation is useful as a measure of variation within a given set of data. It tells that how much data is spread around mean. Consider two populations A and B which have standard deviation of 5 pound and 10 pound respectively. Apparently it seems to be that population B has more spread of data. But that cannot be concluded until we know about mean. On inquiry it was told that population A has mean weight of 10 lb and S.D. as 5 while population B has mean weight of 200 lb and S.D. of 10. Now it is apparent from the population A that with a mean weight of 10 lb, a standard deviation of 5 lb is showing a wide spread of data while with population B where average weight is 200 lb a difference of 10 lb is tolerable. Hence standard deviation alone is not sufcient to provide information about the spread of data while comparing unless we get information about mean as well. This relationship of standard deviation to mean is known as coefcient of variance 1,11 (CV) and is calculated as follows: CV = S.D. # 100 mean (4.20)

Here CV = coefcient of variance, SD = standard deviation, So for population A; CV = 5 # 100 = 50% 10

50 40 30 20 10 0
1 2 3 4 5 6 7 8 9 10 Curve III Curve I Curve II

while for population B; CV = 10 # 100 = 5% 200

It can be seen that population A (CV =50%) has more variability (spread) around mean than population B (CV=5%) while considering standard deviation alone we had opposite conclusion. All these variables range, variance, standard deviation

Figure 11-1
Three normal distributions with same mean but different standard deviations

Health Information and Biostatistics der curve for 2 SD will be 95.45% and 3 SD will be 99.73% area which means that 99.73 value will be between 3 standard deviation. It is very important information and the distribution of area under curve is shown in Figure 11-2

7
40 35 30 25 20 15 10 5 0

34.13 2.5 x-2SD 13.51 x-1SD

34.13 13.51 2.5 x+2SD x+3SD

Standard Normal Curve


As we have seen that if data from a population is plotted on a graph, it can assume any shape from curve 1 to curve 3 as shown in Figure 11-1. Statisticians have designed a curve known as standard normal curve for the ease of calculation of area under curve. This curve has a mean of zero with standard deviation 1 and the total area under curve is one10. It is a smooth, bell shaped perfectly symmetrical curve based on innitely large number of population. The mean, median and mode all three have the same value. The distance of a value Xi, from is called relative deviate or standard normal variate and denoted by Z and calculated by the following formula5 Z= (Xi - X) SD (4.21)

x-3SD

x 68.27% 95.45 99.73%

x+1SD

Figure 11-2
Distribution of area under curve for various values of standard deviation

of 16 gm or more? What would be the total number of people above 16 gram if total population is 10,000. The relative deviate Z = (Xi - X) 16 - 12 4 = = =2 SD 2 2

A random variable Xi is said to be standardized when its mean is 0 and standard deviation is 1. The new variate z is also considered to follow the normal distribution process. The area under standard normal curve has been calculated. An extract of these values is given in the Table 11-2. (Complete table may be found in any of the statistical book.)

The area under curve corresponding to a deviate Z = 2 = 0.4772 (Table 11-2). Since we are dealing with half of the curve, the area beyond 2 would be 0.5 0.4772 = 0.0228 = 228 10000

Estimation of Probability
Let us suppose that mean haemoglobin of a selected group is 12 gm with a standard deviation of 2 gm. What is the probability that a person picked at random will have haemoglobin

So there is a probability of only 0.0228 that we can get individuals with 16 grams or higher in this population or in other words we may say that 228 out of 10,000 individuals will be having hemoglobin more than 16 gm. Similarly if we want to know the number of anaemic in a population of 10000, when the cut of value of anaemia is 10 gm with mean haemoglobin 12 gm and S.D. = 2 gm Z= (Xi - X ) 10 - 12 - 2 = = =- 1 SD 2 2

Table 11-2
Distribution of area under curve for various values of Z Relative deviate
0.00 0.5 1.00 1.5 2.0 3.0 4.0

Proportion of area under curve from middle of the curve


0.0000 .1915 .3413 .4332 .4772 .4987 .49997

The area under curve corresponding to a deviate 1 or 1 = 0.3413 As we are interested in knowing the population beyond this value of 0.3413 will be subtracted from 0.5 as shown by Figure 11-3. The area under curve (shaded area) will be 0.5 - 0.3413 = 0.1587 = 1587 10000

Health Information and Biostatistics for colours. One can write green, blue, red as categories I, II, III. It can also be written as blue, red, and green. Hence if the purpose of categories is just naming them, data will be labeled as Nominal. Other examples are types of Blood, race, name, sex, name of country, name of crops, type of blood etc.

10

12

14

16

Figure 11-3
The area beyond 1 standard deviation (Below 10 gm) shaded area.

So 1587 people out of 10000 will have a haemoglobin below 10 gm. Note we have known the anaemic population just by knowing two values mean and standard deviation without actually conducting survey of the population, which will help us in planning and decision making. Similarly we can calculate number of hypertensive or number of diabetic in a population if we know the mean and standard deviation and if we can safely assume that population is normally distributed which is true for most of the variables.

As the name suggests when the categorical data can be arranged in ascending or descending order on the basis of their quality, the data is known as ordinal data. Note that the exact difference between the two groups cannot be estimated e.g. Grades in a class A,B,C. We know that A is better than B but how much, it cannot be estimated e.g. Student X scoring 70% will be placed in Grade A and student Y scoring 69% will be in Grade B. Similarly student T scoring 100 will be placed in Grade A and student Z scoring 60 will be placed in B. So a difference of as small as l% will place one student in Grade A and other in B (in case of X and Y) while a difference as large as 40 may exist between the two students of Grade A and B (in case of student T and Z) hence we cannot estimate the exact difference between the two students placed in Grade A and Grade B. However we can safely say that Grade A is better than B. Other examples are severity of pain. A severe, B moderate, C mild. Tumour grades: I, II, III, IV. Types of hypertension.

Ordinal data14

Types of Data
So far we have learned how to present the data in summarized form using certain key words like mean and standard deviation. Now if we have to present the data to give message about the major information at a glance, how are we going to present it. Before presenting, analyzing and interpreting information we need to know about types of data because this concept will be used not only to decide the shape of the data for presentation but to analyze and interpret data as well. There are different classication in different textbooks but we would be using them together to give a holistic picture of classication: a) Qualitative (descriptive/categorical/frequency count) data b) Quantitative (numerical/continuous)

Quantitative Data (Numerical/continuous)


When the interval between two values can be divided into innite interval, the data is known as quantitative data or continuous data e.g. age (25.5 years), weight (70.5 kg), height (1.5 meter), hemoglobin (12.5 gm), blood pressure (130/90), blood sugar (120.0 gm)dl. Continuous data may be interval or ratio type

This is a scale with a true zero point. Here the measurement are on actual scale e.g. 4 meter is twice in length to 2 meter and 60 years is 6 time more than 10 years. So length and age are ratio scale if measured in meter and years respectively. Most of the variables in nature are example of ratio scale. e.g. age, weight, blood, temperature on Kelvin scale, haemoglobin, urea, blood sugar etc. and zero means zero

Ratio15

Qualitative Data
When the data are arranged in categories and there is a gap between two values. It is further divided into two types.

The categories are only distinguished by their name and labels and cannot be classied one above another, e.g. Consider sex as variable for classication, if one places male as rst category and female as second or vice versa, there is no effect by changing the order of category. Same is true

Nominal data13

The categories are arranged in equally spaced units and there is no absolute zero point e.g. temperature where 0C does not mean no temperature but is equal to 32F or 273 K (Kelvin Scale). Here any values at a ratio scale is taken as starting point for the better comprehension of data e.g. Temperature is measured in Kelvin Scale. However for practical purpose Celsius has designed an arbitrary scale for the temperature to be used in daily routine. He considered 273K equal to 0oC, starting point of centigrade scale. Thus normal body temperature of 310oK is easily read as 37oC.

Interval2

Health Information and Biostatistics However 20oC is not twice as hot as 10oC because oC is an arbitrary scale and 20oC is actually equal to 293 K and 10oC is equal to 283 K. Hence the difference of temperature from 10oC to 20oC is only a difference of 10 K from 283K to 293 K and not twice.

9
D If percentages or averages are to be compared, they should be placed as close as possible. E No table should be too large. F Most people nd a vertical arrangement better than a horizontal one because, it is easier to scan the data from top to bottom than from left to right. G Foot notes may be given, where necessary, providing explanatory notes or additional information.

Transformation of Data
The continuous data can be transformed into discrete e.g. birth weight of infants may be categorized into: I Low < 2500 gm II Normal > 2500 < 4000 gm III Overweight > 4000 gm

Simple Table
When characteristic with values are presented in the form of table it is known as simple table. See Table 11-2.

Presentation of Data
Data once collected should be presented in a scientic style which depends, of course, on type of data. Data can be presented in many shapes, tables, charts, graphs, diagrams, pictures and special curves. Here we would discuss some of the important ones.

Frequency Distribution Table


In a frequency distribution table16, the data is rst split up into convenient groups (these groups are calculated by a certain formula so that they should not either be too large or too little yet sufcient enough to give message preferably in between 5 to 10 groups ) and the frequency of each group is shown in the adjacent columns. Following are the systolic blood pressure of patients coming to a tertiary care hospital OPD. 144, 146, 120, 100, 160, 130, 140, 150, 100, 110, 90, 95, 150, 160, 90, 180, 100, 150, 110, 130, 135, 138, 144, 95, 122, 136, 140,121, 130, 170, 150, 158, 102, 110, 116, 122, 180, 130, 135, 124, 122, 90, 146, 124, 126, 122, 130, 170, 122, 98, 122, 124, 170, 145, 130, 144, 150, 135, 160, 170 The data given above may conveniently be analyzed in the form of a tally sheet and later on in the shape of a frequency distribution table A frequency distribution table not only shows the frequency of each category but also the relative proportion of each category along with cumulative relative frequency as well to give idea of proportion below a certain level. (Table 11-4)

Tabulation
Tables are devices that are used to present the data in simple form and probably the rst and most commonly used method for presenting information. Though it is easy to use but for a reader it is difcult to grasp the concept, at a glance, specially for large tables. Following points should be given due importance while Tables are being used as the tool for presentation. Tabulation is the rst step before the data is used for analysis or interpretation. The following principles should be borne in mind in designing tables: A A title must be given to each table. The title must be brief and self-explanatory. B The headings of columns or rows should be clear and concise. C The data must be presented according to size or importance, chronologically, alphabetically or geographically.

Table 11-4
Distribution of frequency of blood pressure of patients coming to a Tertiary Care Hospital OPD. Distribution
Below 100 100-120 121-140 141-160 Above 160

Table 11-2
Infant mortality rate of selected countries Name of country
Pakistan Bangladesh Sri Lanka India Afghanistan

Frequency
6 9 24 15 6

Relative
0.10 0.15 0.40 0.25 0.10

Cumulative Relativite
0.01 0.25 0.65 0.90 1.00

Infant mortality rate


90 60 26 60 200

10
Charts and Diagrams
They are the powerful tools for presentation of data and give information about the data at a glance and better retained in memory than tables. Though we get information at a glance but we will have to compromise on details of data. Some of the more common diagrams are as follows:
z z z z z z z z z z z z z

Health Information and Biostatistics

Table 11-5
Distribution of under 5 mortality by the cause of death during 2004 in Pakistan. Cause of Death
Diarrhoea Acute Respiratory Tract Infections (ARI) Vaccine preventable diseases Other Total

Frequency
1000 1200 500 600 3300

Allocated angles
109 131 55 65 360

Pie chart Simple Bar diagram Multiple bar diagram Component bar diagram or subdivided bar diagram Histogram Frequency polygon Frequency curve O give curve Scatter diagram Line diagram Pictogram Statistical map Venn diagram

Poportion of ARI related deaths

= 1200 3300 = 1200 # 360 = 131c 3300

Angle for ARI related deaths

Presentation of Categorical Data


Most common way of presenting the data and very popular among laymen17. The value of each category is divided by the total values and then multiplied by 360 and then each category is allocated the respective angles to present the proportion it has. See Figure 11-4 Examples See Table 11-5 Calculation for angle for ARI Frequency of ARI related deaths = 1200 Total Deaths = 3300

Similarly angles are calculated for other diseases as well and then plotted in the shape of a circle. If there are 4 to 5 categories, the pie chart is suitable diagram to present the information. It is clear from the shape that this type of diagram is used to present categorical data. However if categories are six or more, it is no more effective.

Pie Charts

If there are many categories (more than 5) then the pie chart does not give message at a glance. Then we can use rectangles to present the information. The height of rectangle represents the frequency or the magnitude. The bars are separated by appropriate spaces which is visible to naked eye. A suitable scale is drawn to present the bar. As there are gaps the data presented is categorical. The Bars could be horizontal or vertical. Figure 11-5

Simple Bar diagram

18%

15%

300 250 200 150 100 120 150 160 230 200 100 260

300

30% Acute Respiratory Tract Infections (ARI) Diarrhoea

37% Vaccine preventable diseases Other

70

50 0 One Two Three Four

Five

Figure 11-4
Distribution of under 5 mortality by the cause of death during 2004 in Pakistan.

No. of Students

Six Seven Eight Nine

Figure 11-5
Year wise enrollment of students in Oriental Scholars

Health Information and Biostatistics


500 450 400 350 300 250 200 150 100 50 0 Grade A Grade B Grade C Oriental Scholars Usman Public City Habib Public

11
100% 80% 60% 40% 20% 0%
O Ci ... Ha Al Pu b. Be . Fa ... . G n. ...

rie

nt

al

Us ...

an

ty

bi

iA

li..

ac

on

lco

ra

distribution of grades

...

er

Grade C

Grade B

Grade A

Figure 11-6
Distribution of students of various schools in Karachi by grade in 2005

Figure 11-7
Distribution of students of various schools in Karachi by grade in 2004

Look that there are 9 years data plotted on the same graph and yet each bar is giving clear picture of each year enrolments showing the highest and lowest enrolments at a glance.

vertical axis. The area of each block or rectangle is proportional to the frequency. Figure 11-8 is the histogram of the frequency distribution of scores of anxiety level of students appearing in FCPS part II examination. Figure 11-8

When each category of data is further subdivided into two or more categories the simple bar can no more be used to present the data e.g. if you are comparing the A, B, C grade student of various schools, multiple bar is a useful option to present this information. Figure 11-6 Multiple bars are used to present the ordinal data as well. This is also used when one variable cross tabulated against other2.

Multiple Bar Diagrams

A frequency distribution may also be represented diagrammatically by the frequency polygon. It is obtained by joining the mid-points of distribution readings. Figure 11-9

Frequency Polygon

Line diagrams are used to show the trend of events with the passage of time. The following is an example of a line

Line Diagram

When there are many categories on the X axis (more than 5) and they have further sub-categories, the multiple bar cannot be used because all the categories cannot be presented on the same graph. In order to accommodate the categories, each category (which is being represented separately on multiple bar diagram. e.g. grades in class A, B, C as separate categories is further divided in the same rectangle as shown in Figure 11-7. Though we lose the message at a glance but we accommodate more categories in the same graph. All the above examples are of categorical data presentation being used in different situations.

Component Bar or Subdivided Bar Diagram

10

2
Std. Dev = 5.17 Mean = 10.0 N = 48.00 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Presentation of Continuous Data


It is used to present variables which have no gaps e.g. age, weight, height, blood pressure, blood sugar etc18. It consists of a series of blocks (Figure 11-8). The class intervals are given along the horizontal axis and the frequencies along the

Histogram

Score

Figure 11-8
Histogram of the frequency distribution of scores of anxiety level of students appearing in FCPS II examination.

12
30

Health Information and Biostatistics

20

10

0 90 110 130 150 170 190 210 230 250

10 9 8 7 6 5 4 3 2 1 0

World South East Asia Other Region

1972

73

74

75

76

77

78

Figure 11-9
Frequency distribution of readings of systolic blood pressure

Figure 11-10
Malaria Cases Reported, 1971 - 1978 (excluding African Region)

diagram. (Figure 11-10) showing the trend of malaria cases reported throughout the world (excluding the African Region) during 197272.

Pictograms are a popular method of presenting data to the man in the street and to those who can not understand orthodox charts8. Small pictures or symbols are used to present the data. For example, a picture of child to represent the under 5 child mortality. (Figure 11-11). Fractions of the picture can be used to represent numbers smaller than the value of a whole symbol. In essence, pictograms are a form of bar charts.

Pictogram

the intakes of fat and sugar in the average diets of 41 countries. Populations with more income are shown to consume more protein, fat and also sugar10 If the dots cluster round a straight line, it shows evidence of a relationship of a linear nature. If there is no such cluster, it is probable that there is no relationship between the variables. (Figure 11-13)

Ogive curve
It is used to present the accumulated information. With the help of this graph, the population at a particular point can easily be calculated. It shows cumulative relative frequency. Figure 11-14.

When statistical data refer to geographic or administrative areas, it is presented either as Statistical Maps or Dot maps according to suitability, The shaded maps are used to present data of varying size8. The areas are shaded with different colours or different intensities of the same colour, which is indicated in the key. Figure 11-12

Statistical Maps

Generalization of Result of a Sample over the Whole Population


Standard Error
Once we calculate the mean of a sample we are interested

Scatter diagram shows the relationship between two variables e.g. Figure 11-13 shows a positive correlation between
USA Sri Lanka China Bangladesh Pakistan

Scatter Diagram

50

100

Figure 11-12
Rocky Mountain spotted fever: counties reporting cases, United States. 1993, (From Centers for Disease Control & Prevention: Summary of notiable diseases, United States: 1993 MMWR 42.48. 1994.)

Figure 11-11
Under 5 mortality by countries 2004

Health Information and Biostatistics

13
100 80 60 40 20 0

200

Sugar Intake

150 100 50

50

100 Average Food Intake

150

200

100.5

110.5

120.5 Weight

130.5

140.5

150.5

Figure 11-13
Relation between average fat intake and sugar intake

Figure 11-14
Weight (Cumulative Frequency Polygon, OGIVE)

in knowing the mean of population as well. Suppose we have taken a sample of 100 women for their haemoglobin level and the mean obtained is 12 gm. Now the question is what is the population mean. The answer depends upon what is the standard deviation in the population. If there is no spread in the data, the standard deviation will be 0 and the population mean will be same as sample mean. If there is a small value of standard deviation the sample mean will be very close to the population mean and if there is large standard deviation the simple mean may be far away from the population mean. For example, if there are 50 individuals in population and they have the same haemoglobin level i.e. 14gm. As there is no variability, the standard deviation will be zero. If we take a sample of two persons from the population their mean Hb will be 14+14/2 (as each individual will have Hb level of 14 gm) = 14 gm and the sample mean will exactly equal population mean. Suppose if we have some 50 individuals with same mean Hb level of 14 gm but there are some individuals who have Hb level of 12,13,15, and 16 gm. As there is variability in the data. Now a sample of two persons from this population may have a variation as well. However if we increase the sample size to about 10 the sample mean may be close to the population as the increase in the variability will be accommodated by increase in the sample size. Now if with the same population mean of 14 gm of 50 individuals, there are few people who have Hb level between 6 gm to 18 gm. Can this sample size of 10 individuals will have the same sample mean. Obviously not. Due to increased variability of the data, the sample mean may be much away from the population mean. However if we increase the sample size to about 30 it may accommodate the variability of the data and the sample mean may again be close to the population mean. Note the relationship of the sample mean to the population mean. If there is no variability (standard deviation=0) a very small sample will give the mean equal to population mean.

But if there is moderate variability (standard deviation) a moderate sample size will give the sample mean close to the population mean but if there is large variability (S.D) of data then the same moderate sample size will not be sufcient and we will have to take a large sample size to accommodate the variability and to get a sample mean close to the population mean. Hence if we have to estimate population mean from sample mean it will depend upon two factors 1. standard deviation 2. sample size This relationship of standard deviation to sample size is known as Standard Error2 (SE). So Standard Error tells us that how much away would be our sample mean from population mean. Naturally we would like to keep the standard error as minimum as it could be. If there is small value of standard deviation, a small sample size is sufcient to estimate population mean but if there is a large standard deviation we increase the sample size to keep the SE minimum by neutralizing the effect of S.D. As S.E. is directly proportional to S.D. and inversely proportional to the root of sample size, formula for the Standard Error is SE = SD N (4.22)

Here S.E. = Standard Error SD = Standard Deviation N = Sample size

Confidence Interval
Suppose we want to know population mean() from the sample mean can we estimate that? We can not exactly determine the population mean with the help of sample . However we can construct an interval between which the

14
population mean () may be. This interval is known as condence interval19. Now the question is, are we one hundred percent condent about population mean? As the results are sample based we can not be 100% sure. Though we would like to be 100% condent. As the estimate of u with the help of sample mean depends upon the relationship of S.D. and sample size (known as SE). The formula for population mean will be U = X ! 2SE (4.23)

Health Information and Biostatistics mean() would be between 10.66 gm to 13.33 gm. Now if you look closely at the condence interval the range has become too wide i.e. from 10.66 to 13.33. It means that population mean may have any value close to be anaemic (10.66) to a healthy population mean (13.33) which does not convey any message and someone may argue that if you at the end of study are giving such a wide range then what was the purpose of the study if we can not decide about the type of population, Anaemic or healthy!. Take another example where with the same mean (=12) and sample size (N=9) the standard deviation is 4 gm. Now what would be the condence interval about mean. UCI = X ! 2SE or = X ! 2 SD N (4.28)

As we would construct an interval for u so the value would be both in + and direction. We have seen that for 95% condence interval for values based on standard deviation are approximately 2 standard deviation (1.96 to be more specic). Same is true for the distribution of sampling mean as well. For example, If we obtained a sample mean of 100 women as 12 gm with a standard deviation = 2, now what would be the condence interval (CI) for the population mean. SD U = X ! 2SE or = X ! 2 N
CI

4 8 UCI = 12 ! c2 # m or ` 3 j or (2.66) 9 UCI=12 2.66 or 9.33 14.66 (4.29)

Interpretation

(4.24)

By placing the values in the equation U = 12 ! c2 #


CI

2 4 m or ` 10 j or (0.4) 100 or 11.6 12.4 (4.25)

We would be interpreting that we are 95% condent that population mean haemoglobin would lie between 9.33 to 14.66. (A much wider range than previous). In simple words, population mean would either be of anaemic population (9.33 gm) or very healthy population (14.66 gm). A meaningless interpretation with a very wide range of condence interval for population mean Now with the same mean (=12 gm) and sample size (N=9) if the standard deviation is 0.5 gm what would be our condence about population mean. UCI = X ! 2SE UCI = 12 ! c2 # or = X ! 2 SD N (4.30)

UCI = 12 0.4

We can say with 95% condence that population mean will be between 11.6 gm to 12.4 gm i.e. population is healthy.

Interpretation

Role Of Sample Size And Standard Deviation On Generalization Of Result


Suppose we had another sample of 9 individuals (N=9) with S standard deviation(S=2) and Mean (=12) what would be the condence interval now. UCI = X ! 2SE or = X ! 2 SD N (4.26)

0.5 1 m or ` 3 j or (0.33) 9 (4.31)

UCI = 12 0.33 or 11.66 12.33

2 4 UCI = 12 ! c2 # m or ` 3 j or (1.33) 9 UCI = 12 1.33 or 10.66 13.33 (4.27)

Note that in this situation even with a small sample size of 9 we have a very narrow range of 11.66 12.33, even better than what we obtained when the sample size was 100 but the standard deviation was 2. (see 4.25) So in this situation a sample size of even 9 is adequate.

Interpretation

We can say with 95% condence that the population

The condence interval about the population mean () based on sample mean() depends on the standard deviation in the population which can not be controlled by the researcher as it is inherent in the population. However to neutralize the effect of standard deviation, researcher may increase or

Conclusion

Health Information and Biostatistics decrease the sample size in order to get a narrow range for condence interval. So to use sample size of some other study (where it is evident that standard deviation might be different from our setting) will not be appropriate in our setting, it might either be too small to interpret the result and get narrow range or too large threatening to waste the resources. In situation (b) again there are two possibilities i)

15
We fail to reject the H0 and the true situation was that it should have not been rejected. Hence we were right in doing so and the decision was correct. ii) We failed to reject H0 while it should have been rejected, so we were wrong in this case. Hence while dealing with H0, we can make both correct as well as incorrect decisions based on possible evidences available to us compared to the true situation in universe. To summarize this we can use a 2 x 2 table where the true situation (Disease, Not diseased) is kept as the column heading and the researcher decision/test/procedure/ examination /method (which may or may not be correct) is kept as rows heading as a rule. See Table 11-6. Let us take an example of court decision about a person to be guilty or not guilty. The True situation is known to God only but the judge decides on the basis of evidence provided by the prosecution, H0 in this situation will be that the person is innocent as other people in the population (not guilty). Alternate hypothesis will be that the person is guilty. Now the court has only two options, a) to decide a person guilty (b) to declare the person not guilty and exonerate. In both decisions a and b the situation would be as shown by the Table 11-7. If court decision (a) matches with true situation then the decision is correct otherwise court committed Type 1 error. In decision (b) of the court if the decision matches with true situation, the decision is correct, but if not, court committed Type II error. It is conventional to remember that in most of the situations, the rst desire of researcher is to reject H0 and if he did this, there is possibility of committing rst error which is Type 1 error. In other words we can dene Type I error as rejecting H0 when actually it is true or we may rephrase it as falsely rejecting the H0. Similarly accepting the H0 when actually it is false is known as Type II error. We always make decision in research on the basis of available evidence. Someone may ask us that whether there is a chance of error in your decision? Or in other words are you 100% sure? We answer by saying that yes there are chances of error how can we be sure through the result of sample about the whole population. Then the next question would be how much chances of error in your decision? Logically we would like to keep these chances of error as minimum as it could be. The researcher throughout the world agree that

Hypothesis, Level of Significance and P Value


Any statement about a population is termed as Hypothesis1. There are two types of hypothesis: Null hypothesis H0 and Alternate Hypothesis HA. Null Hypothesis2 claims that there is no relationship (difference) between two groups being compared and denoted by H0. e.g. if we want to know about the complication rate of two surgical procedures for hysterectomy, Abdominal Hysterectomy (AH) and Vaginal Hysterectomy (VH), Null hypothesis will claim that there is no difference between the two surgical procedures with respect to post operative complications. While Alternative Hypothesis is generally the research question framed by the researcher and it may either be:a) There is difference in complications between AH and VH (two tail) or Non directional b) VH has less complications than AH (one tail) or Directional. c) AH has less complication than VH (one tail) or Directional. It is evident from the denition that H0 is hypothesis which is usually against the claim of the researcher and every researcher would like to reject this hypothesis. Now when we were trying to reject H0 there are only two possibilities:a) Either we reject the null hypothesis b) We fail to reject the H0. In situation (a) again there are two possibilities. i) We rejected the H0 and we are right in doing so. ii) We rejected the H0 and we were wrong.

Table 11-6
True Situation Decision based on test procedure / exam
Difference exists Difference does not exist

Table 11-7
Difference exists Difference does not exist
Incorrect Type I Error Correct

True Situation Court Decision


Guilty Not Guilty

Guilty
Correct Beta or Type II Error

Not Guilty
Alpha Error or Type I Error Correct

Correct Incorrect Type II Error

16
a chance of error up to a maximum of 5% (0.05 in terms of probability) is reasonable one in making decision about H0 (though in some special situation we may allow an error up to 10% and in certain situations when we want to be very sure, an error of only 1% (0,01) as well. This chance of error (5%) which is set before starting experiment and is xed at 5% is known as level of signicance or Alpha() Here we would like to dene p value. It is dened as:a) Probability of committing Type I error (while type I error has already been dened as probability of rejecting H0 when actually it is true, or b) It is also dened as probability of rejecting H0 when it is true, or c) Probability of falsely rejecting H0 or d) Probability of getting a result by chance. Note that level of signicance () and p value have the same denition except that is set before starting the experiment and is xed at 5% (sometimes 1% or 10% as well), but p value is calculated after completing the experiment and it is not xed and can assume any value such as 0.00, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.5, 0.0001, 0.00001 or any other value for that matter. The obtained p value is compared with level of signicance () to decide bout the H0. If p value is less than i.e. 0.03, 0.04, 0.01, 0.001 or 0.0001, we ill reject H0 and if it is more than i.e. 0.06, 0.07, 0.08, 0.1, 0.2, 0.5 etc., we will not reject H0.

Health Information and Biostatistics for different types of tests of signicance. The application of type of test depends on the type of data, we are analyzing. We would be seeing later which test of signicance should be applied in which situation. Some of the tests of signicance are: 1) 2) 3) 4) 5) 6) 7) 8) 9) Standard error for mean Standard error for proportion Standard error of difference between mean Standard error for difference between proportion Chi square test Z Test Student t test Analysis of variance (ANOVA) Correlation & Regression

Standard Error Of Mean


We have already discussed the standard error in detail. We know the standard error depends on the standard deviation and sample size. SE is also used to decide about the difference between two groups. If a mean value lies between 2SE, we say that it is from the same population and if the value lies outside the 2SE, we say that this mean is not from the same population but from some other population.

Example
Let us suppose we obtained a random sample of 100 individuals with mean Haemoglobin (Hb) level of 12 gm and standard deviation of 2 gm. a) What possible range of mean Hb level could we expect within 95% Condence limit. b) What would be your conclusion about the sample if the population mean is 11.2 gm. As we know the formula for condence limit is (see 4.24) U = X ! 2SE
CI

Test of Significance
Now we are heading towards decision making in research, we have decided that we would be rejecting H0 if p value is less than 5% or 0.05 but from where this p value is obtained. The answer is by application of certain test of signicance. The basic principle behind all these tests of signicance is that they will be calculating the difference between groups. More the difference is, more we would be condent in saying that there is a real difference. When we compare the two or more groups, the possibility could be; (a) No difference (b) Some difference In case a, we do not reject the H0. In case of b, now there are two probabilities:i. There is slight difference just due to chance alone ii. There is large difference, which cannot be explained by chance alone and we say that there is some difference and hence reject the H0. Now the question is what small difference is and what large difference is. Scientists have set certain cut off value which are used to decide about the difference and they are different

or

= X ! 2 SD N

UCI = 12 ! c2 #

2 4 m or ` 10 j or (0, 4) 100

UCI = 12 0.4 or 11.60 12.40 So we can say with 95% condence that the sample was drawn from a population where mean would lie between 11.6 gm 12.4 gm. As the population mean is 11.2 gm, well outside the condence limit of the sample, we can conclude that this sample was not from the population whose mean was 11.2 gm. Null hypothesis will be rejected and the difference will be stated as statistically signicant. This whole idea can be made clear with the help of following

Health Information and Biostatistics

17
Let us suppose that in a population of 10000 the working people are 5200. A random sample of 100 individuals was taken and the proportion of working people was 0.40 or 40%. a) What possible range of working people we will expect in a sample of 100 with 95% condence limit? b) What conclusion can be drawn from the sample. Note that data is discrete and we are dealing with proportion, not with mean. Hence formula of S.E. for mean will not be applicable. We would use standard error of proportion in this case. In proportion when there are only two possible outcomes (success or failure, black or white, live or dead) also labelled as mutually exclusive event, one outcome is labelled as p and the other as q. In this case the (p) the probability of being worker is 5200/10000 = 0.52 or 52%, hence the probability of being not worker (q) is (1p) = (10052) = 48% since the total probability is always 1 or 10. The formula for S.E. for proportion is 7 pq = n 52 # 48 = 25 = 5 100 (4.33)

95%

3SE 11.4

-2SE 11.6

-1SE 11.8

X 12g

1SE 12.2

2SE 12.4

Figure 11-15
Distribution of sample mean with a sample range of 100

Figure 11-15. Remember here we use S.E. instead of S.D. to show the sampling distribution. It is because the effect of S.D. has been neutralized by good sample size. As we have already decided that anything which falls between 2SD (95%) condence limit will be taken as from the sample population and anything outside this will be from a population different than this. The diagram shows a mean of 11.2 is well outside the condence limit. Hence difference is statistically signicant. In other words we may say that there are very rare chances (less than 5%) that a sample mean of 12 gm would be from a population where mean is 11.2 gm and if the chances are less than 5%. We reject chance explanation and conclude that the sample mean is different than the mentioned population mean.

We take 2 SE on either side of the proportion as our cut off point. Hence the formula would be PCI = p 2 (SE) = 52 2 (5) = 42% 62% Here PCI is the condence interval for population proportion and p is the sample proportion. We observe that we can say with 95% condence that the population proportion of the working people lie between 42% to 62%. As the sample proportion obtained is 40, outside the condence limit of population proportion. This sample is from some other population and not from the population whose working force proportion is 0.52 or 52%. Hence we conclude that the difference is statistically signicant

Relative Deviate
We can determine the difference by using the formula of relative deviate as well. As we know X-U Z= SE 12 - 11.2 = 0.8 = Z= 4.0 0.2 0.2 The relative deviate of 2 is taken as cut off value for the signicant difference. As the relative deviate is more than 2 we conclude that the difference is statistically signicant. Note that we are using standard deviation and mean and the data is numerical or continuous type hence the test used will be S.E. of mean. (4.32)

Standard Error of Difference between Two Proportions


When we are confronted with a situation where we have to compare proportion between two groups instead of mean, we use standard error of difference between two proportions to decide about the signicance difference.

Example
As a medical superintendent we met two dealers who prepare vaccine A and B for the prevention of measles and they both claim that their vaccines have excellent results. The vaccines were given to a large number of individuals and 22 out of the 90 vaccinated by A and exposed to infection, developed infection while 14 out of 86 vaccinated by B, exposed to infection, developed infection. From the given

Standard Error of Proportion

18
scenario it seems to be that vaccine B is better than A. Now we would see whether this difference is due to chance alone or a statistically signicant difference. We calculate the probability of infection (p) for vaccine A which is number of infection divided by total exposed = No. of infection/total exposed p1 = 22 = 0.244 90 and if expressed = 24.4% in percentages (4.34)

Health Information and Biostatistics that the difference is large enough to be called signicant or not but we cannot tell that how large the difference is. For that purpose to calculate the exact difference and to nd out the p value we apply either chi square test or Z test or students t test to nd out the exact difference and p value and the application of the test will depend upon type of data2. For the qualitative type of data where proportions are being compared the test applied will be chi square test . Suppose we have two vaccines A and B for the prevention of measles and we have to decide which vaccine is more effective to be included in National Program. We applied vaccine A to 100 children and 20 of them later on developed infection. We applied vaccine B in another 100 children and 15 developed infection. Apparently vaccine B is better than A. Now we have to see that this difference is statistically signicant or by chance alone. We use a 2x2 table to compute the result. Table 11-8 If the null hypothesis is to be true, there should have been no difference in the proportion (percentage) of infection in vaccine A and vaccine B. Note that there are total 35 children out of 200 who have suffered from infection. Suppose that this infection has to take place in the children due to other factors irrespective of efcacy of vaccine (other factors may include technique, storage, immunity status of children etc.) Hence we calculate the proportion of the attack in the studied population which is, No. of people attacked/total number of exposed children = 35 (column total) /200 (Grand Total) = 0.175 or in percentage = 17.5 No. of people not attacked = = 36.27 = 6 165 = 0.825 = 82.5% 200

Hence the q in this case will be q1 q1= (1-p1) = 100 24.4 = 75.6% (4.35)

Let us calculate the proportion of infection in vaccine B P2 = 14 = 0.162 or 16.2% 86

Here the q will be q2 q2 = (1- p2) = 100 16.2 = 83.8% (4.36)

We calculate the SE of difference between two proportion by the following formula SE = = p 1q 1 p 2q 2 + n1 n2 24.4 # 75.6 + 16.2 # 83.8 90 86 (4.37)

= 20.79 + 15.76

The standard error of the difference is 6 whereas the observed difference between two proportions is (24.4 16.2) = 8.2 As we know that the difference should at least be, double the SE, to be called signicant difference, and the observed difference is less than double we would conclude that the difference in attack rate of two vaccines is not due to true difference between the quality of two vaccines but because of chance alone. We would not reject null Hypothesis (H0) and will say difference is statistically insignicant.

Now we construct another table based on the above data supposing that there is no difference in the two vaccines (H0 True) what number of people we will expect in each cell a, b, c and d. As there are total 200 individuals and the number of attacked are 35 we would expect half of them (17.5%) in vaccine A (cell a)and half (17.5%) of them in vaccine B (cell c). The formula for the calculation is

Chi Square Test


If we closely look at the test of signicance used so far we observe that all these tests measured the difference between either population mean and sample mean, one sample mean and other sample mean, population proportion with sample proportion or sample proportion with another sample proportion (in the last example) and concluded that is there a signicant difference between the two or not. We can just tell

Table 11-8
Distribution of observed attack rate Measles vaccine exposed
A B Total

Attacked
20 a 15 b 35

Not attacked
80 c 85 d 165

Total
100 100 200

Health Information and Biostatistics

19
Table 11-10
Cell Total
100 100 200 a b c d

Table 11-9
Distribution of expected attack ratio Vaccine
A B Total

(O-E)
20-17.5 = 2.5 80-82.5 = -2.5 15-17.5 = -2.5 85-82.5 = 2.5 Total

(O-E)2
6.25 6.25 6.25 6.25

(O-E)2/E
6.25/17.5 = 0.33 6.25/82.5 = 0.07 6.25/17.5 = 0.33 6.25/82.5 = 0.07 0.8

Attacked
17.5 17.5 35

Not attacked
82.5 82.5 165

N.B.: O=Observed, E=Expected

Expected Value (EV) =

RT # CT GT

signicant. A table has been designed by researcher for this purpose as shown in Table 11-11. We observe the value for 1df (degree of freedom at 5% (0.05) level of signicance which is 3.84 and marked with two arrows in the table. We calculate the degree of freedom 2 (df) by using the following formula for a Table. df = (r1) x (c1) Here r is for row and c is for column. As we have 2 rows and 2 columns at 2 x 2 table df = (21) x (21) = 1 x 1 = 1 Hence degree of freedom in 2 X 2 table is one (1). We compare the calculated chi square value 0.8 with cut off value 3.84 and conclude that the difference is too small to be called signicant. Hence vaccine A or B are equally effective and the difference in the results are due to chance alone.

Here RT=Row Total, CT=Column total and GT=Grand total Row total for cell a is 100 and column total is 35 while GT is 20. Hence expected value for cell a will be # = 35 100 = 17.5 200 Using the same formula the expected value for all cells are calculated and expected 2 x 2 tables is computed which is as follows: Now for the Null hypothesis to be true there should be no difference between observed value. Table 11-8 and expected value Table 11-9. However we nd the difference, let us see that this difference is due to chance alone or real difference. We use the following strategy stated in Table 11-10. The standardized difference between the two groups has been worked and was calculated as = 0.8 (Commonly known as Chi square value) Now the question is what cut off value should be used to decide that the difference is large enough to be called

Example 2
The result of eld trial for vaccine A and B given for the prevention of mumps is as shown in Table 11-12. From the above it appears that vaccine B is better than vaccine A. We have to conrm whether the difference is real or by chance.

Table 11-11
D.F.
1 2 3 4 5 6 7 8 9 10

0.5
0.45 1.39 2.37 3.36 4.35 5.35 6.35 7.34 8.34 9.34

0.1
2.74 4.61 6.25 7.78 9.24 10.65 12.02 13.36 14.68 15.99

0.05
3.84 5.99 7.82 9.49 11.07 12.59 14.07 15.51 16.92 18.31

0.02
5.41 7.82 9.84 11.67 13.39 15.03 10.62 18.17 19.68 21.16

0.01
6.04 9.21 11.34 13.28 15.09 16.81 18.48 20.09 21.67 23.21

0.003
7.68 10 12.84 14.36 16.75 18.55 20.28 21.96 23.59 25.19

0.001
10.83 13.62 16.27 18.47 20.31 22.46 24.32 26.13 27.88 29.59

Here we will use both observed and expected value in Table 11-13.

Table 11-12
Result of the Field Trial Mumps Vaccine A and Vaccine B Vaccine
A B Total

No. attacked
20 30 50

No. not attacked


80 270 350

No. of vaccinated
100 300 400

Attack rate
20% 10%

Source: JE Park K. Park, Text Book of Preventive and social Medicine

20
Table 11-13
Observed (O) and Expected (E) values Vaccine
A B

Health Information and Biostatistics

Table 11-14
O-E
A B c d 20-12.5=+7.5 80-87.5= -7.5 30-37.5= -7.5 270-262.5=+7.5

(O-E)2
56.25 56.25 56.25 56.25

(O-E) 2 /E
56.25/12.5=4.5 56.25/87.5=0.64 56.25/37.5=1.5 56.25/262.5=.2

Attacked
O=20 E= 12.5 O=30 E=37.5

Not attacked
O=80 E= 87.5 O=270 E=262.5

Calculated value of X2 = 6.85

Now compute the X2. See Table 11-14. Calculated value of X2 = 6.85 As the Critical Value 3.84 is less than calculated value 6.85; the H0 is rejected. This suggests that there is a signicant difference in the efcacy of vaccine A and B.

The critical value of t test is 2.0 at a sample size of 49 and degree of freedom= 48. (See Table 11-15) Degree of freedom for t test will be equal to n1=491=48 in this case. Since the calculated value 2.8 is on right side of the critical value, we conclude that there is signicant difference in the sample mean and population mean and the p value is less than 0.05.

Students t Test
If we compare two proportions we use chi square test. However if we have to compare the two means (continuous data or numerical data) we would have to use a different test of signicance commonly known as Students t test. As we are seeing the difference between two means hence the t entry in the formula for t test will be Ta X - U (4.38)

Tests of Statistical Significance: Correlation and Regression Analyzing Associations that Involve Interval/Ratio (i.e. Continuous) Data
A. Overview of useful statistical procedures. Earlier we discussed statistical procedures that allow clinical investigators to analyze associations involving frequency data. B. Both types of procedures enable clinical investigator to determine 1. The statistical signicance of an observed association 2. The magnitude of the association 3. The amount of variation in the response variable that is attributable in a putative risk factor. C. Correlation analysis and regression analysis are two procedures used to analyze associations involving continuous, or interval/ratio, data10. 1. Correlation analysis measures the strength of the association between two study variables. The term correlation analysis as used in this chapter, refers to Pearsons product moment correlation coefcient (also known as Pearsons r) 2. Regression analysis derives a prediction equation for estimating the value of one variable given the value of the second. 3. Correlation and regression analyses functions similarly for all study designs. D. Data representation and regression

As the difference between two means is dependent on the population standard error, the second entry will be standard error. More the S.E. will be, less would be the importance of the difference. Hence the decision will be inversely related to the value of standard error. Hence Ta 1 SE (4.39)

So the formula will be T=


^ X - Uh X-U or T = SE SD c m n

(4.40)

Here is sample mean, u is population mean, SE is standard error, SD is standard deviation. N is sample size.

Example
Suppose a population has a mean Hb of u=12 gm. A sample of 49 is taken from the population which has a = 11.2 gm and standard deviation= 0.2 gm Calculate the 95% condence interval for the likelihood of getting a simple mean of 11.2 gm from the formula 4.40. T=
^ X - Uh

SD c m n

^11.2 - 12h

0.2 c m 49

= 0.8 = - 2.8 0.28

Health Information and Biostatistics

21
1. The data used in a correlation or regression analyses consist of pairs of measurements made on the same unit of observation most often, the same study subject). Each member of the pair corresponds to one of the two study variables. For example, in a study of the relationship between hypertension and blood cholesterol levels, systolic blood pressure and serum cholesterol value is the pair of measurements to be assessed for each study subject and to be represented by the two study variable. 2. Notation. The pairs are denoted symbolically (X, Y). By convention, X typically represents the independent variable, while Y represents the dependent, or response, variable. 3. Variation among study types a. In epidemiological studies, the independent variable X is often a suspected risk factor (e.g. a low ber diet) and the dependent variable Y is the occurrence of disease or other health-related outcome e.g. the occurrence of colorectal cancer). b. In experimental studies, values of the independent variable are xed by the investigator rather than determined by nature. For example, in a study of the efcacy of a new antiviral drug in preventing the recurrence of herpes simplex infections, the investigator selects the dosages (the independent variable to be administered. Thus, dosage is a xed, not a random. Variable. E. Correlation vs. Regression 1. Correlation analysis is restricted to studies in which both variables are random, or determined by nature e.g. epidemiological studies. 2. Regression analysis may be used in either of the following situations: a. when one variable is xed and one is random (the classic regression model) b. when both variables are random (the correlation model) 3. When both variables are random, the term dependent variable and independent variable are irrelevant 4. Tabular representation of data pairs prior to correlation or regression analysis is shown in Table 11-16. =31.5 Sy=15.995 =315 Yi=315 (Xi)2=1128750 Sx=123.15 XiYi=116375 Xi=3150

Table 11-15
Table for t Test df
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 45 50 60 70 80 90 100 120 140 160 180 200

t.90
3.078 1.886 1.638 1.533 1.476 1.44 1.415 1.397 1.383 1.372 1.363 1.356 1.35 1.345 1.341 1.337 1.333 1.33 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.31 1.3062 1.3007 1.2987 1.2959 1.2938 1.2922 1.291 1.2901 1.2887 1.2876 1.2869 1.2863 1.2858 1.282

t.95
6.3138 2.92 2.3534 2.1318 2.015 1.9432 1.8946 1.8595 1.8331 1.8125 1.7959 1.7823 1.7709 1.7613 1.753 1.7459 1.7396 1.7341 1.7291 1.7274 1.7207 1.7171 1.7139 1.7109 1.7081 1.7056 1.7033 1.7011 1.6991 1.6973 1.6896 1.6794 1.6759 1.6707 1.6669 1.6641 1.662 1.6602 1.6577 1.6558 1.6545 1.6534 1.6585 1.645

t.975
12.71 4.303 3.183 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.16 2.145 2.132 2.12 2.11 2.101 2.093 2.086 2.08 2.074 2.069 2.064 2.06 2.056 2.052 2.048 2.045 2.042 2.03 2.014 2.009 2 1.995 1.99 1.987 1.984 1.98 1.977 1.975 1.973 1.972 1.96

t.99
31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.65 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.5 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 2.412 2.403 2.39 2.381 2.374 2.368 2.364 2.358 2.353 2.35 2.347 2.345 2.326

t.995
63.657 9.9248 5.8409 4.6041 4.0321 3.7074 3.4995 3.3554 3.2498 3.1693 3.1058 3.0545 3.0123 2.9768 2.9467 2.9208 2.8982 2.8784 2.8609 2.8453 2.8314 2.8188 2.8073 2.7969 2.7874 2.7787 2.7707 2.7633 2.7564 2.75 2.7239 2.6896 2.6778 2.6603 2.648 2.6388 2.6316 2.626 2.6175 2.6114 2.607 2.6035 2.6006 2.576

(Yi)2=12225

Adaped from Lentner C (ed): Geigy Scientic Tables, 8th ed, Volume 2. Basle, Switzerland, Ciba-Geigy, 1982, pp 30-33

(5) Graphic representation: the scatter diagram. Quantitative data obtained in a study of the association between two variables can be graphically displayed in a scatter diagram. Each pair (XY) of values (see Table 11-16) is represented in the scatter diagram by a dot located at the point )X,Y).

22
Table 11-16
Psychosis Intensity scores and Plasma amphetamine levels for 10 Chronic Amphetamine abusers Patient
2 2 3 4 5 6 7 8 9 10

Health Information and Biostatistics relationship. Alternatively, the scatter diagram may show that the two variables are unrelated. c. Linear relationships may be either positive (i.e. as the value of one variable increase, the values of the other variable increase as well) or negative i.e. as the value of one variable increases the other decreases e.g. the relationship between plasma amphetamine level and psychosis intensity is a positive liner relationship (see Figure 11-16) Psychosis intensity scores and Plasma amphetamine levels for 10 Chronic Amphetamine abusers 2. The relationship between X and Y may be nonlinear, or curvilinear For example, the relationship between age and death rate is nonlinear. During the neonatal period, the death rate is high, it decreases, becomes relatively stable through middle age, and then rises again during old age. Figure 11-17 When X and Y are unrelated, the data pairs are randomly distributed (Figure 11-18). For example, no relationship, either linear or nonlinear, exists between foot size and IQ.

Psychosis Intensity Score (Y)


10 30 20 15 45 35 50 15 40 55

Plasma amphetamine mg/ ml (X)


150 300 250 150 450 400 425 200 350 475

When both variables are random variables, the choice of which axis is labelled X and which Y is arbitrary. 1. When regression analysis is to be used to predict the value of one variable from the value of the other, the variable to be predicted i.e. the dependent variable is plotted on the Y axis. a. Example: In a study of the relationship between plasma amphetamine levels and amphetamine induced psychosis, 10 chronic, amphetamine abusers underwent psychiatric evaluation and were assigned a psychosis intensity score. At the same time, plasma amphetamine levels in these patients were determined. The results are shown in Table 11-16. Figure 11-16 is the scatter diagram of this data. b. The relationship between X and Y may be described by a straight line or by a more complex curvilinear

Choosing the axis

II. Correlation Analysis


Assessing the strength of the association between two variables. The correlation coefcient between two variables. The correlation coefcient denoted symbolically as r, denes both the strength and the direction of the linear relationship between two variables.

Characteristics of the correlation coefficient


The correlation coefcient is an index number between 1 and =1. where r=1, the two variables have a perfect negative linear relationship. In this case all points in the scatter diagram fall exactly on a straight line and that slopes downward from left

Psychosis Intensity Score (y)

60 50

30 20 10 0
0 100 200 300 400 500

Plasma amphetamine levels (X)

Variable 2

40

Variable 1

Figure 11-16
Scatter diagram relating psychosis intensity to plasma amphetamine levels in 10 chronic amphetamine abusers.

Figure 11-17
Scatter diagram illustrating a nonlinear relationship between two study variables

Health Information and Biostatistics to right e.g. vehicle weight and horse power. When r=+1, the study variables have a perfect positive linear relationship. In this case, all points in the scatter diagram fall exactly on a straight line that slopes upward from left to right. When r=0, there is not a linear relationship between the study variables. The relationship may be nonlinear (see Figure 11-17). Alternatively, the study variables may be unrelated for example, in Figure 11-18, the relationship between the two variables is described by a line whose slope is equal to 0, that is, a change in one variable has no effect on the other. The better the points on the scatter diagram approximate a straight line, the greater the magnitude of r. The correlation coefcient calculated for a sample drawn from a population of interest r is an estimate of the population correlation coefcient denoted symbolically as p. The population correlation coefcient is a feature of the linear association between the study variables for all members of the population. In other words, r is the statistic that estimates the population parameter p. The correlation coefcient, r, is dened by the formula

23

Variable 1

Variable 2

Figure 11-18
Scatter diagram from a study of two unrelated variables

r=

^Sxxh^Syyh

Sxy

(4.42)

Using the data from Table 11-16 and substituting values in the following computational formula For standard deviation

/^X - X h^Y - Y h
i i

r=

n Sx Sy

(4.41)

r = RXiYi

^RXih^RYih

N
^315h^3150h

(4.43) = 17150

Where is the mean of the observed values for the X variable, Y is the mean of the observed values of the Y variable, Sx is the standard deviation of the value of the X variable, Sy is the standard deviation of the values of the Y variable, and n is the number of pairs, of measurements. Thus a. The correlation coefcient is a dimensionless value (the units of measurement in the numerator and denominator cancel). Thus, r is unaffected by changes in the units, provided the measurements are made on the same subjects. For example, if the units of weight and height are changed from pounds and inches to grams and centimetres, r remains the same. b. When the relationship between X ad Y is positive, values of X above their mean tend to be paired with above average values of Y, while values of X below tend to be paired with below average values of . Therefore, the product (X )(Y ) is more often positive than negative, leading to a positive value for r (note that Sx and Sy are always positive. c. When the relationship between the X and Y variables is negative, values of X above the mean tend to be paired with values of Y below their mean . Therefore, the product (X )(Y ) is more often negative than positive, leading to a negative value of r. For calculation purposes, formula 4.41 may be rewritten

= 116375 -

10

For Sx the computational formula is Sxx = RXi2 ^RXih2

(4.44)

Substituting the values in formula 4.44 = 1128750 ^3150h2

10

= 136500

Similarly for Syy Syy = RYi2 ^RYih2

(4.45)

Substituting the values in formula 4.45 = 1225 ^315h2

10

= 2302.5

Substituting these values in formula 4.42 r=


^136500h^2302.5h

17150

= .97

24
Thus, the correlation between psychosis intensity and plasma amphetamine level is r=.97, a nearly perfect positive correlation. HA; p0).

Health Information and Biostatistics

Assessing the statistical significance of an association


Using an appropriate statistical test, it is possible to address the question Is there a statistically signicant linear relationship between two study variables in the population from which the samples were selected. As with other statistical tests of hypothesis, the answer to this question revolves around the most likely explanation for the disparity between the sample estimate (i.e. the sample correlation coefcient r) and its corresponding population parameter (i.e. the population correlation coefcient p).

Example
The psychiatrist conducting the study of the relationship between amphetamine induced psychosis and plasma amphetamine levels wishes to determine whether a statistically signicant association between these variable exists in the population from which the sample was selected. That is, he wishes to know whether it is likely that a correlation coefcient of the magnitude r=.97 or greater would be obtained from a sample of 10 subjects by random chance when the population correlation coefcient (p) is equal to 0. The investigator follows the ve steps of statistical hypothesis testing. State the hypothesis: The null hypothesis (H0) is Plasma amphetamine levels are not linearly related to psychosis intensity in the population of patients from which the 10 study participants were selected (i.e. H0: p=0) and the alternative hypothesis) HA) is: There is a linear relationship between Plasma amphetamine levels and psychosis intensity in the population from which the study subjects were selected (i.e.

a. Select a sample and collect data. b. Calculate the test statistic. The test statistic measures the disparity between the observed sample correlation coefficient (r=.97) and the value assumed for the population correlation coefficient by H 0 (p=0). The test statistic for this example is the sample correlation coefcient r=.97. (Alternatively, both the test statistic may be calculated and statistical significance assessed using a t test). c. Evaluate the evidence against H0. Frequency distribution of the test statistic. The frequency distribution of all possible values of the test statistic when H0 is true is provided by Table 11-17. The number of degrees of freedom (df) associated with the frequency distribution of the test statistic is df =n2, where n equals the number of pairs of observations, Here, df=102=8. Calculating the p-value: The p-value is the probability of obtaining the calculated value of the test statistic by random chance when H0 is true, in Table 11-17, the value for = .01, df = 8 is .7646. Since r =.97 falls to the right of .7646, p< .01. Hence in this example, the probability of obtaining a sample correlation coefcient equal to or greater than r=.97 from a population with a correlation coefcient p=0 is less than 1%. Decision rule. Prior to data collection, the psychiatrist chooses a 5% level of signicance (a = .05). Based on the level of signicance, he derives the following decision rule If p<.05, reject H0). d. State the conclusion because in this example, the calculated value of r was .97, the following can be concluded: Statistical conclusion, since p=.05, H0 is rejected at the =.05 level of signicance.

Table 11-17
Critical Values of the Correlation Coefcient for Different Levels of Signicance. df*
1 2 3 4 5 6 7 8 9 10

.05
.996917 .95000 .8783 .8114 .7545 .7067 .6664 .6319 .6021 .5760

.01
.9998766 .990000 .95873 .91720 .8745 .8343 .7977 .7646 .7348 .7079

df
11 12 13 14 15 16 17 18 19 20

.05
.5529 .5324 .5139 .4973 .4821 .4683 .4555 .4438 .4329 .4227

.01
.6835 .6614 .6411 .6226 .6055 .5897 .5751 .5614 .5487 .5368

df
25 30 35 40 45 50 60 70 80 90 100

.05
.3809 .3494 .3246 .3044 .2875 .2732 .2500 .2319 .2172 .2050 .1946

.01
.4869 .4487 .4182 .3932 .3721 .3541 .3248 .3017 .2830 .2673 .2540

* The degrees of freedom (df)= (the number of pairs in the sample 2). Reprinted from Fisher RA: Statistical Methods fro Research Workers, 14th ed. Now york, Hafner Press, 1970,p209.

Health Information and Biostatistics Clinical interpretation. There is a statistical signicant association between Plasma amphetamine level and psychosis intensity in the population of patients from which the 10 study subjects were selected (p < .01). Chance of error. When H0 is rejected, the chance of error, type 1 error rate) is given by the p-value or by the level of signicance alpha. In this case, the chance of error associated with concluding that the population correlation coefcient p is not equal to 0. (i.e. the chance that a sample correlation coefcient equal to or greater than .97 would be obtained by random chance from a population whose p=0 is less than 1% (p < .01). When not to reject H0 a hypothetical situation! If, however, the calculated value of r had been .48 (i.e. p>.05), the following could be concluded. Statistical conclusion in this instance, since p>, H0 is not rejected at the =.05 level of signicance. Clinical interpretation. The results are not statistically signicant (p>.05), the data do not suppose the evidence of a signicant liner association between Plasma amphetamine levels and psychosis intensity at the 5% level of signicance in the study population. Chance of error. When H0 is not rejected, the chance of error is , the Type II error rate: Power of the test. The power of the statistical test is dened as the probability that the test will detect a statistically signicant association between the two study variables, given that such a difference actually exists. a. The value of power depends on the effect size, the sample size, and the chosen level of signicance. b. A failure to reject H0 may be due to low power of the statistical test, as well as to the absence of an association between the study variables. When power is low, the statistical test may simply be too weak to detect an association of the specied magnitude in the population from which the study sample was drawn.

25
c. For large sample such power is typically high and small deviations of p from 0 (i.e. small effect sizes) can be detected by the statistical test. Therefore, an estimate of the magnitude of the population correlation coefcient p is an important adjunct to the demonstration of statistical signicance. This estimate is provided by r, the sample correlation coefcient, and r2, the coefcient of determination.

Coefficient of Determination (R2)


Denition. The coefcient of determination, r2, measures the proportion of the variation in one variable that can be attributed to, or explained by, variation in the second variable8. This r2 is that proportion of the variance in one variable that can be explained by its linear relationship to the other. The coefcient of determination, therefore, is the counterpart of attributable risk. While r2 denes the magnitude of the association, it does not dene the direction. The sign of r indicates whether the association is positive or negative. Characteristic of r2: When r2=0 (r, of course, also equals 0) none of the variation in Y can be attributed to changes in X. When r2 = 1, all of the variation in Y is attributable to its linear relationship with X. Graphic representation: Figure 11-19 illustrates a perfect positive correlation (i.e. r=+1 and r2= 1 or 100% ) The linear relationship between X and Y accounts for all of the observed variations in Y There is no variation in the Y values for a xed value of X because all of the points representing the data pairs fall on a straight line. In Figure 11-20, there is variation among the Y values of xed values of X. This variation of Y may be attributed in part to its relationship with X. Other factors including systematic

30

Variable 2

10 0
0 20 30 40 60

Variable 2

20

40

Variable 1

Variable 1

Figure 11-19
Graphic representation of r2. The linear relationship between X and Y accounts for all of the observed variation in Y and r2 = 1.

Figure 11-20
The variation in Y is only partially attributable to the linear relationship between X and Y; this proportion of the total variation is described by r2, whih lies between 0 and 1.

26
variation resulting from the relationship between Y and other unknown variables, and random subject-to subject variationaccount for the remainder of the variation. By specifying the proportion of the total variation in Y that is attributable to its linear relationship with X, r2 provides a mathematical tool for separating these sources of variation. When r is statistically signicant, r2 is as well, it may be concluded that the proportion of variation in Y that is attributable to X differs signicantly from 0 in the population from which the sample was drawn. population.

Health Information and Biostatistics

Failure to demonstrate the statistical signicance of a given value of r may be due to the absence of a linear relationship between X and Y (i.e. H0 is true and the population correlation coefcient p=0 or due to low power of the statistical test. Nonsensical or spurious correlation may be obtained when average or aggregate data for groups of subjects are substituted for pairs of measurements on individual subjects) known as the ecological fallacy).

Example
In the study of the association between amphetamine induced psychosis and Plasma amphetamine levels, r=.97 and r2=(.97)2 =.94. That is, 94% of the variation in psychosis intensity can be attributed to variations in Plasma amphetamine level. The proper interpretation of r correlation does not imply causality. The existence of a statistically signicant correlation between study variables does not prove that a cause-andeffect relationship exists between them. A statistically signicant correlation between two study variables does not imply that the association is clinical important. Statistical signicance merely indicates that the calculated value of r is unlikely to have resulted from random chance when the population correlation coefcient p=0. When the sample size is large, the correlation may achieve statistical signicance even though the actual deviation of p from 0 is small. The size of the p-value indicates the likelihood that an association exists, it does not specify the magnitude of that association. The value of r (or, preferably, of r2) is the best indicator of the magnitude of the association in the

Example
A study examining the relationship between aggregate mortality from coronary artery disease in 18 countries and the average wine consumption in each country reported an unexpected strong negative correlation between the two variables. Correlation derived from each aggregate studies may disappear when data for individual subjects are analyzed.

III. Regression Analysis


The goal of regression analysis is to derive a linear equation that best ts a set of data pairs (X, Y ) represented as points on a scatter diagram The equation can be used to predict values of the response variable (Y) for given values of the independent variable (X).

Derivation of the regression line

Form of the equation


= b0 + b1X

The geeral equation for the sample regression line is

The general equation for te corresponding population regression line is = B0 + B1X


(x1,y1) (x3,y3) Y =b0+b1X

The slope of the sample regression line (b1)is the change in the average value of Y to every one unit change in the value of X. The Y intercept of the sample regression line (bo) is the Y value corresponding to X= 0, it is the point where the regression line crosses the Y axis. The sample regression line estimates the population regression line.

(x2,y2)

(x4,y4)

Figure 11-21
Graphic representation of line that best ts four hypothetical data points

Graphic representation. Figure 11-21 depicts a regression line for a hypothetical group of four data points. No single line passes through all four points simultaneously, but the regression line represents the best t to the data.

Health Information and Biostatistics The statistical technique for nding the line that best ts a particular data set is known as the method of least squares. The best-tting line is dened as the one for which the sum of the squared distances of all of the data points from the line (known as the residual sum of squares) is minimized. The vertical deviation of any point (X, Y) from the regression line (the estimated error or residual) is dened as di = Yi i = Yi (b0 + biXi) Where Yi, denotes the observed value of Y at the given value of X and iY represents the corresponding value of Y derived from the regression equation. Therefore, the sum of the squared distance of each of he points from the line is di2 = (Yi i)2 = (Yi b0 biXi)2 The values of b0 (Y intercept) and b1 (sloped) that describe the line with the smallest residual sum of squares (i.e. the line for which di2 is minimized are obtained using the principle of calculus. level (X) to psychosis intensity (Y) is therefore i = 8.08 + .126 X

27

Describing the variation around the regression line


The standard deviation of the data points about the regression line is called the standard error of the estimate, denoted symbolically Sxy The formula for computing the standard error of the estimate is S xy = SSE n-2

Where n= the number of (X, Y) pairs and SSE is the error or residual sum of squares, dened as SSE = Syy b1Sxy Where b1 is the slope of the regression line.

Example
For the regression line derived for Table 11-16 = 2302.5 (.126) (17150) As Sxy = SSE = n-2 = 147.756

Calculating the slope and intercept of the regression line


The formula for the slope of the regression line is b1 = Sxy Sxx

147.756 = 4.3 8

The standard error of the estimate, Sxy = 4.3 is the sd of the observed psychosis intensity scores (Ys) around the regression line.

The formula for the Y intercept of the regression line is B0 = Y b1X When X represents the mean of the X value and Y the mean of the Y values.

Estimation of Population
For comparison of health conditions in different places at the same time or at different times in the same place relative gures i.e. Rates and Ratios are used as an index. For this purpose population of the community for which a particular rate is desired to be determined forms a base reference. It is therefore essential, rst to have the information of total number of persons in the community. Number of persons in any community can be determined by counting which is called Census. The rst complete census on modern line was taken in Sweden in 1749. Other countries followed it. In Indo-Pakistan the rst census was taken between 186772 and this was repeated in 1881. Thereafter, it has been carried out every 10th year. Pakistan came into existence in1947 and the rst census in Pakistan was taken in1951, the second was done in 1961, the third in1972, the fourth in 1981 and the fth in 1998. Census are taken either by (1) enumeration (2) by question-

Example
The psychiatrist conducting the investigation of the relationship between Plasma amphetamine levels and amphetamineinduced psychosis also examines this data using regression analysis. Using the values for computed during correlation analysis, the slope and Y intercept of the regression line are calculated as b1 = Sxy = 17150 = .126 Sxx 136500

B0 = i b1X = 31.5 (.126)(315) = 8.08 The regression equation relating Plasma amphetamine

28
naire and (3) by combination of rst two methods. The rst method is used in Pakistan and USA while the questionnaire method is adopted in England. In the enumeration, enumerator visits every home in the area, and elicits the desired information from the head or the responsible members of the families. The desired information comprises name, age, sex, marital status, caste, ethnicity, educational characteristics, monthly income, occupation, urban or rural etc. In the questionnaire method, the questionnaire form is handed over to the family or the household who is required to ll in this form.

Health Information and Biostatistics

Example
Estimate the mid-year population for the year 1990 from the following information. Census population for town X First of April 1971 = 500,000 Census population for the same town First of April 1981 = 700,000 From this given information the mid-year estimated population for 1990 can be calculated as follows:The increase in population from 1971 to1981 is the difference between the 2 census gure, i.e. 700,000 500,000 = 200,000 This increase is for 10 years. Therefore it divided by 10. The annual increase comes to 200,000/10 = 20,000. Multiply it by 9 years and 4 months = 28/3 years (months also converted in years) time interval equivalent the increase in population from April 1981 to July 1990 wll be 20, 000 56, 0000 = = 186, 667 3 3 The increase in population is added to the Census gure of Census 1981 and therefore the mid year estimated population for 1990 = 700,000 + 186,667 = 8,86,667

Type of Census
Census may either be de facto or dejure. In the de facto20 a person is counted at the place he or she is found at the time of counting while in dejure a person is counted at the place of his or her usual or normal residence. For census years, rates can be calculated on the basis of Census years population but for inter-censal year the population of the year has to be estimated. For the estimation of population of any year the following methods are used.

Estimating Population Outside Census


There are three main methods
z z z

Natural increase method Arithmetic progression method Geometric progression method

Geometric Progression Method (G.P .)


This method takes into account the increase in the number of potential parents each year who are also responsible for the increase in the population. Therefore the procedure is similar to that of compound interest whereas A.P. method is analogous to simple interest method. G.P. method is based on the assumption that the rate of increase of population remains constant. Beside these 3 main methods, a number of other techniques are also utilized. For example, the ratio between inhabited houses and the population ratio between the voters and the general population or the ratio between school children and general population can be utilized to estimate the population for any year. It may be mentioned that the mid-year population is calculated for the rst of July of a year as it existed at the middle of the year.

Natural Increase Method


According to this method population in any year is equal to previous census population plus births recorded during the period and persons who have migrated into the community during this period minus deaths during the same period and number of persons who may have left the area. It is the simplest method and a very reliable one but it is dependent on the availability of information and the accuracy, in the face of the present unreliable statistics available in our country this method is not used. Moreover the method cannot be used for a part of the country for which records of immigration and emigration are not available. This method is used in UK where recording is very satisfactory.

Arithmetic Progression Method (A.P)


The difference between the 2 previous census gures is divided by the period between two census years to nd an annual increase in the population. On the basis of this annual increase remaining constant, population for any year can be estimated. The procedure is illustrated below.

Growth Rate of Population of Pakistan


In 1901 Census existing part of Pakistans population was 16.6 million. Following Table 11-18 shows the increase in population.

Health Information and Biostatistics

29
Director Health Services concerned, where the compilation of these data is undertaken.

Table 11-18
Growth of Population Year
1901 1911 1921 1931 1941 1951 1961 1972 1981 1991 1995 1998 2010

Population (millions)
16.6 19.4 21.1 23.6 28.3 33.7 42.9 64.9 83.8 110 115 122 130.58 160.91 Estimated Portion of India 1947 crating Pakistan Former East Pakistan separated as Bangladesh Rural 72%, Urban 28% Estimated Prior to division of India

Urban Areas
When a birth or death occurs in any household the head of the family is required to make a report generally within a week of occurrence to the local body, the municipal ofce or the Cantonment Board. If for any reason, he is unable to do so himself, the report is made by any adult member of his family. Birth and death registers are kept in Municipal Registry Ofce and copies of entries in the registers are forwarded each week to the DHO/ADHO/CS who transmit the same to the Director Health Services of the Region concerned for compilation.

Births and Deaths in Medical Institutions


The hospitals and dispensaries supply information pertaining to the occurrences of births and deaths taking place in these institutions to the local health authorities, which ultimately reaches the DHO/ADHO/CS of the district concerned. In the urban areas the birth and death registration is done by local bodies while in rural areas it is done by the village headman or Union Council. There is a scheme wherein it has been considered that a uniform system of registration of births and deaths be adopted in the whole of Pakistan and registrars be appointed for these purposes, who will be directly under the control of the Provincial Health Department, it has been felt that the Police and Revenue authorities cannot take proper interest in the registration of these events and as such a large number of births and deaths escape registration. The extent of such under-registration is very large in certain cases.

N.B. The gures for former Pakistan have been excluded throughout.

System of Collection and Registration of Vital Events in Pakistan


The system used in Pakistan for the collection and registration of important events follow the following scheme.

Registration of Births and Deaths


The registration of births and deaths is carried out throughout the provinces of Pakistan excepting the Agencies and Tribal areas of the North-West Frontier Province, Balochistan, Trans fronted tract of Dera Ghazi Khan District and Kalat Divisions. In Quetta Division the registration of births and deaths is conned to Quetta City and Quetta Cantonment only There is a uniform system of registration of births and deaths in the areas comprising the Punjab, NWFP, and the former Bahawalpur State. The area of Sindh Province has however a different type of system for registering these events. The system in vogue in the areas comprising of provinces of NWFP, Punjab and Bahawalpur State is as follows:

Health System Reporting


For the development of effective information system following instruments have been nalized to collect data from rst level care facilities (FLCF) and were approved by the Federal Ministry of Health and by Provincial Health Department during two National Health Management Information System (HMIS) Workshops held in Islamabad in January and July 1992. a) Population Data Collection b) Patient/Client Record Cards c) Facility Record Keeping System A serious effort has been undertaken to reduce the existing number of facility based registers. From an estimated 40 registers, only 19 were approved and developed in a nal format. d) Facility Reports Most important, the reporting system has been sub-

Rural Areas
Particulars of each birth and death occurring in a village are recorded in two books. One for births and another for deaths kept at the union council ofce. The copies of the entries in the Union Council registers are forwarded to the District Health Ofcer/Assistant District Health Ofcer/Civil Surgeon concerned through the Superintendent of Police every week/fortnight by the police +station authorities. The DHO/ADHO/CS transmit such copies of birth and death registers of all the rural circles in the district to the

30
stantially simplied. All programmatic reports have been abolished and are replaced by three comprehensive reports: Immediate Reports for Epidemic Disease Monthly Report Yearly Report Report Transmission and Data Processing System All reports are to be transmitted through regular line management channels from First Level Care Facility to district, then to division, to province, and to federal levels. Time tables for transmission have been agreed upon All data processing is computerized in an initial stage at provincial and at divisional levels, ultimately also at district levels. Feedback Mechanisms Personnel of First level care facility (FLCFs) and their supervisors are encouraged to use the data collected on an immediate basis for management of the health facilities.
112 113 114 115 116 117 118

Health Information and Biostatistics


Probable Whooping Cough Goiter Suspected Viral Hepatitis Suspected AIDS Snake bite with signs and symptoms of poisoning Dog bite Scabies

1. 2. 3. e)

* Only the subcategory of Diarrhea has been mentioned here. However the other priority problem, also, have subcategory which can be found in the FLCF Manual.

f)

Research Area
With progress of research in the eld of community health, there is increasing awareness of the role of statistics in dealing with questions of survey designs, sampling errors and signicance testing. Computers are now in use in several programs relating to health. Their use in obtaining hospital data, diagnosis and other areas has opened a new era in the development of health programs as well as in training. Their use simplies the statistics and makes it interesting to the health professionals.

Priority Health Problems


Under HMIS/FLCF, only priority health problems as dened in the list of indicators during the 2nd National HMIS Workshop will be reported to higher levels. In order to have comparable reporting throughout the country, standard denitions for each of these priority health problems, have been developed, recording has been standardized, and a coding system developed. Following is a list of the priority health problems. To each of these priority health problems, a special three digit code has been designed, if more specication is needed, the main three digit code is followed by a fourth digit, separated from the three digitcode by a dot. HIMS/FLCF Code
101 101.0 101.1 101.2 101.9 102 103 104 105 106 107 108 109 110 111

References
1. 2. Kuzma JW, Bohnenblust SE. Basic Statistics for the Health Sciences. 5th ed. New York Mc Graw Hill 2004 Siddiqui MI, Baig LA, Iliyas M, Ahmed N. Health information and biostatistics. In Iliyas M. (Ed) Public Health & Community Medicine. 7th ed. Karachi Time publisher 2006. Larsen LR, Marx ML. An Introduction to Mathematical Statistics and the Application. Second Edition New Jersy, Prentice Hall 1986. Glaser AN. High Yield Biostatistics New York John Wiley & Wilkins 1995. Knapp RG Miller MC. Clinical Epidemiology and Biostatistics. Baltimore, Williams and Wilkins. 1992. Isaac R. the pleasure of probability. New York Springer-Verlag 1995. Daniel WW. Applied Non-parametric Statistics. Second Edition PWSKENT Boston 1989. Park K. Preventive and social Medicine. 20th ed. Jabalpur, Banarsidas Bhanot 2009 Daniel WW biostatistics: A foundation for analysis in the Health Sciences. 7th edition New York. John Wiley & sons 1999. Knapp RG, Miller MC. Clinical epidemiology and biostatistics. Baltimore Williams & Wilkins 1992. Park JE. Text Book of Preventive and Social Medicine 2003. Statistical Report, Ministry of Health 2003. Mehta CR. The Exact Analysis of Contingency Tables in Medical Research. Statistical Methods in Medical Research 1994; 3: 135156. Daniel WW biostatistics: A foundation for analysis in the Health Sciences. 5th edition New York. John Wiley & sons 1991. Department of Medical Education. Research Methodology, Biostatistics and Medical Writing. Karachi College of Physicians and Surgeons Pakistan. 2006 Jamal S. Elements of Statistics. Karachi Rehber Publishers 1998

3. 4. 5. 6. 7. 8. 9. 10 11 12

Health Problem
Diarrhea* For children of less than ve years specify Without dehydration With some dehydration With severe dehydration Dehydration status not specied Dysentry Acute respiratory infections Fever (Clinical Malaria) Cough more than two weeks SUSPECTED Cholera SUSPECTED meningococcal Meningitis PROBABLE poliomyelitis Probable Measles Probable/Conrmed Neonatal Tetanus Probable Diphtheria

13 14

15

Health Information and Biostatistics


16 Ross JG, Sundberg EC and Flint KH. Informed consent in school health research: Why, How and making easy, Journal of School Health 1999; 69: 171-176. Walpole RE, Myers RH, Myers SL. Probability and statistics for engineers and scientists. 6th ed. Macmillan; International 1993 Chaudhariy SM, Kamal S. Introduction to statistical Theory. Lahore Markazi Kutub Khana 2007 Caswell Fred. Success in statistics. London John Murray 1982. Iliyas M, Ahmed N. Health information and biostatistics. In Iliyas M. (Ed) Community Medicine & Public Health. 4th ed. Karachi Time publisher 1997.

31

17 18 19 20

Index
A
Alternative Hypothesis 15 Application of statistical Knowledge (Uses) 2 Assessing the strength of the association between two variables 22 Geometric Progression Method (G.P.) 28 Natural Increase Method 28 Estimation of Population 27 Estimation of Probability 7

F
Facility Record Keeping System 29 Facility Reports 29 Feedback Mechanisms 30 rst census 27 Frequency Polygon 11

B
Back Ground 2 Births and Deaths in Medical Institutions 29

C
Coefcient of Determination (R2) 25 coefcient of variance 6 Coefcient of Variance 6 Coefcient of Variance (CV) 6 Component Bar or Subdivided Bar Diagram 11 condence interval 20 Correlation Analysis 22 Assessing the statistical signicance of an assoc 24 Characteristics of the correlation coefcient 22 Correlation analysis and regression analysis 20 Correlation vs. Regression 21

G
Generalization of Result of a Sample over the Whol 12 Condence Interval 13 Standard Error 12 Growth Rate of Population of Pakistan 28 Registration of Births and Deaths 29 Rural Areas 29 System of Collection and Registration of Vital Eve 29 Urban Areas 29

H
Health Information and Biostatistics 1 Main Functions (Uses) 2 Research Area 30 Health System Reporting 29 Histogram 11 Hypothesis, Level of Signicance and P Value 15

D
de facto 28 Denition 2 degree of freedom 5, 19 degree of freedom (d.f) 6 dejure 28 Derivation of the regression line 26

I
Important Point 3 Interpretation 14

E
Estimating Population Outside Census 28 Arithmetic Progression Method (A.P) 28

32
L
level of signicance or Alpha( ) 16 Line Diagram 11

Health Information and Biostatistics steps of statistical hypothesis testing 24 Summary 4

T
Tabulation 9 Test of Signicance 16 Chi Square Test 18 Relative Deviate 17 Standard Error of Difference Between Two Proportio 17 Standard Error of Mean 16 Standard Error of Proportion 17 Students t Test 20 Tests Of Statistical Signicance\ Correlation and 20 Transformation of Data 9 Type 1 error 15 Type II error 15 Type of Census 28 Types of Data 8 Qualitative Data 8 Quantitative Data (Numerical/continuous) 8

M
Mean 3 Measures of Central Tendency 4 Measures of Spread of Population 6 Median 3 Mode 3 Multiple Bar Diagrams 11

N
Normal Distribution 6 normal distribution curve 6 Null Hypothesis 15

O
Ogive curve 12

P
Patient/Client Record Cards 29 Pictogram 12 Population Data Collection 29 Power of the test 25 Presentation of Data 9 Charts and Diagrams 10 Frequency Distribuition Table 9 Presentation of Categorical Data 10 Presentation of Continuous Data 11 Simple Table 9 Tabulation 9 Priority Health Problems 30 p value 16

V
variance 5

Q
Quantitative Data (Numerical/continuous) 8

R
Range 4 Regression Analysis 26 Calculating the slope and intercept of the regress 27 Describing the variation around the regression lin 27 Report Transmission and Data Processing System 30 Role of Sample Size and Standard Deviation on generalization of result 14

S
Scatter Diagram 12 Simple Bar diagram 10 Simple Table 9 Standard Normal Curve 7 Statistical Maps 12

Anda mungkin juga menyukai