Anda di halaman 1dari 18

Reliability What do we mean by the reliability of a measuring instrument?

Contrary to its usual meaning in common parlance, in scientific measurement if something is reliable it is not necessarily good. (A security guard who falls asleep in the middle of his watch at exactly the same time every night would be a reliable sleeper, but he would undoubtedly be fired for his lack of alertness.) A reliable instrument is one that produces consistent measurements, which may or may not be worth anything. For example, a thermometer that reads 105 degrees Fahrenheit every time it is inserted in the mouth of a child who has no fever is reliable; it is consistent from one insertion to the next, but it is not a very good thermometer because it implies that the child has a high fever when (s)he actually does not. The matter of consistency of measurement is not referred to as reliability in all scientific disciplines. Some people prefer accuracy, precision, agreement, dependability, reproducibility, repeatability, or the term consistency itself. Reliability is defined as the extent to which a questionnaire, test, observation or any measurement procedure produces the same results on repeated trials. In short, it is the stability or consistency of scores over time or across raters. Keep in mind that reliability pertains to scores not people. Thus, in research we would never say that someone was reliable. As an example, consider judges in a platform diving competition. The extent to which they agree on the scores for each contestant is an indication of reliability. Similarly, the degree to which an individuals responses (i.e., their scores) on a survey would stay the same over time is also a sign of reliability. An important point to understand is that a measure can be perfectly reliable and yet not be valid. Consider a bathroom scale that always weighs you as being 5 lbs. heavier than your true weight. This scale (though invalid as it incorrectly assesses weight) is perfectly reliable as it consistently weighs you as being 5 lbs. heavier than you truly are. A research example of this phenomenon would be a questionnaire designed to assess job satisfaction that asked questions such as, Do you like to watch basketball games?, What do you like to eat more, pizza or hamburgers? and What is your favorite movie?. As you can readily imagine, the responses to these questions would probably remain stable over time, thus, demonstrating highly reliable scores. However, are the questions valid when one is attempting to measure job satisfaction? Of course not, as they have nothing to do with an individuals level of job satisfaction.

Reliability Versus Validity The scores obtained from a test can be quite reliable, but not valid. Suppose a researcher gave a group of third year students two forms of test designed to measure their knowledge of the World History and found their scores to be consistent; those who scored high on form A also scored high on form B; those who scored low on A scored low on B; and so on. We would say that the test is reliable. But if the researcher then used these same test of scores to predict the performance of the students in Mathematics classes, it is not valid. Moreover, if the tests yield unreliable scores, we cannot make any valid inferences for they provides no useful information. We cannot determine which scores to use to infer an individuals ability, attitudes or other characteristics. Distinction between validity and reliability

Error of Measurements Research requires dependable measurement. Measurements are reliable to the extent that they are repeatable and that any random influence which tends to make measurements different from occasion to occasion or circumstance to circumstance is a source of measurement error. (Gay) Reliability is the degree to which a test consistently measures whatever it measures. Errors of measurement that affect reliability are random errors and errors of measurement that affect validity are systematic or constant errors.

A participants score on a particular measure consists of 2 components: Observed score = True score + Measurement Error A participants true score is the score that the participant would have obtained if measurement was perfecti.e., we were able to measure without error. Measurement error is the component of the observed score that is the result of factors that distort the score from its true value. Both the true score and error score are unobserved and must be estimated. The concept of error score is at the heart of reliability. The goal of good measurement design is to minimize the error component. Note: In the simple model above, error is thought to occur randomly. The importance of random error may be recognized if an assessment is used repeatedly to measure the same individual. The observed score would not be the same on each repeated assessment. In fact, scores are more or less variable, depending on the reliability of the assessment instrument. The best estimate of an examinees true score is the average of observed scores obtained from repeated measures. The variability around the mean is the theoretical concept of error, also called error variance. As noted earlier, measurement error can occur in the form of either systematic bias, which deals with construct validity, or random error, which deals with reliability. Random error can never be eliminated completely Factors that Influence Measurement Error There are several factors that can influence measurement error in test scores. For example: Time Sampling Error: Time Sampling Error is the fluctuation in test scores obtained from repeat testing of the same individual. How an individual performs today may differ on how they will perform on tomorrow. The list below is not an exhaustive list, however, the list below are some things to think about regarding possible sources of measurement error: 1) Personality & Abilities: do not change as much over time 2) Learning maturation: will change in individuals over time 3) Intervening experiences: An individual might learn new information from one administration of an assessment to another administration of an assessment. 4) Carry-over effect: less time between testing can impact ones test score 5) Practice effect: your skills tend to increase with practice

Other sources of Measurement Errors: 6) Length of the test. Longer tests generally have better reliability. 7) Variability in the group of examinees. Generally, the more diversity in the group of test takers, the more reliable the test. 8) Degree of difficulty of test items impacts reliability. Items must be just right, and not too easy or difficult. 9) Situational factors of the research setting (room temperature, lighting, crowding, etc.). The above factors are typical considerations in test construction in an effort to develop reliable tests. The reliability of a measure is an inverse function of measurement error. The more error, the less reliable the measure and reliable measures provide consistent measurement from occasion to occasion How reliable should tests be? There are no absolute standards on what is good or bad reliability. For example, if you are creating a new drug to cure diseases, you may want your reliability to be a little higher than generally accepted standards. The below are some guidelines for gauging reliability. .70 or higher .60-.69 Below .60 is considered Acceptable Reliability is considered Marginally Acceptable Reliability is considered Unreliable Reliability

When a reliability coefficient equals 0, the scores reflect nothing but measurement error Reliability Coefficients Due to a variety of factors such cited previously , some people take the same test twice will seldom perform exactly the same; thus, results in error of measurements. Since error of measurements are always present to some degree, researchers expect some variation of test scores; reliability estimates provides them an idea of how much variation to expect. Such estimates are known as reliability Coefficients. Four Ways to Obtain A Reliability Coefficients 1. Test- Retest Method (Coefficient of Stability) It involves administering the same test twice to the same group after a certain time interval has elapsed. The reliability coefficient is then calculated to indicate the relationship between the two sets of scores obtained

using the Pearson Moment Correlation. Test-retest reliability is the degree to which scores are consistent over time. It indicates score variation that occurs from testing session to testing session as a result of errors of measurement 2. Equivalent or parallel Forms Method (Coefficient of Equivalence) When the equivalent-forms method is used, two different but equivalent forms of a test are administered to the same group of individuals during the same time period. Although the questions are different, they should sample the same content and they should be constructed separately from each other. Used when it is likely that test takers will recall responses made during the first session and when alternate forms are available. A reliability coefficient is then calculated between the two sets of scores obtained. The problem with this method is that it is difficult to construct two forms of test or instrument that are essentially equivalent. 3. Equivalence and Stability When a researcher needs to give a pretest and posttest to assess a change in behavior, reliability of equivalence and stability should be established. In this procedure, reliability data are obtained by administering to the same group of individuals one form of an instrument at one time and a second form at a late date. If an instrument has this type of reliability, the researcher can be confident that a change in score across time reflects an actual difference in the trait being measured. 4. Internal-Consistency Methods There are several internal-consistency methods of estimating reliability, however, unlike the two mentioned above, it requires only a single administration of the test. a. Split-half Procedure- involves scoring two halves (usually odd items versus even items) of a test separately for each person and then calculating a correlation coefficient for the two sets of scores. The coefficient indicates the degree to which the two halves of the test provide the same results. The reliability coefficient is calculated using Spearman-Brown prophecy formula.
= Reliablity of scores 1 x reliability for 1 / 2 test 2 x reliability for 1 / 2 test

For demonstration purposes a small sample set is employed here--a test of 40 items for 10 students. The items are then divided even (X) and odd (Y) into two simultaneous assessments.

X Student Score (40) (20) A B C D E F G H I J MEAN SD 40 28 35 38 22 20 35 33 31 28 31.0 20 15 19 18 l0 12 16 16 12 14 15.2 3.26

EvenY Odd (20) x 20 13 16 20 12 8 19 17 19 14 15.8 3.99 4.8 -0.2 3.8 2.8 -5.2 -3.2 0.8 0.8 -3.2 -1.2

y 4.2 -2.8 0.2 4.2 -3.8 -7.8 3.2 1.2 3.2 -1.8

x2

y2

xy 20.16 0.56 0.76 11.76 19.76 24.96 2.56 0.96 -10.24 2.16

23.04 17.64 0.04 7.84

14.44 0.04 7.84 17.64

27.04 14.44 10.24 60.84 0.64 0.64 10.24 1.44

10.24 10.24 1.44 3.24

95.60 143.60 73.40

From this information it is possible to calculate a correlation using the Pearson Product-Moment Correlation Coefficient, a statistical measure of the degree of relationship between the two halves. Pearson Product Moment Correlation Coefficient:

where : x is each student's score minus the mean on even number items for each student. my is each student's score minus the mean on odd number items for each student. N is the number of students. SD is the standard deviation. This is computed by 1. 2. 3. 4. squaring the deviation (e.g., x2 ) for each student; summing the squared deviations (e.g., x2 ); dividing this total by the number of students minus 1 (N-l) and taking the square root.

The Spearman-Brown formula is usually applied in determining reliability using split halves. When applied, it involves doubling the two halves to the full number of items, thus giving a reliability estimate for the number of items in the original test.

b. Kuder- Richardson Approaches- the most frequently employed method for determining internal consistency in order to correlate all items on a single test with each other when its item is scored right or wrong. The reliability coefficient can be determined using the Richard Kuderson formula or KR20 and KR21. KR21 requires only 3 pieces of information, the K= number of items in the test, the M = mean and the SD = standard deviation. Reliability =
K M (K M ) 1 K 1 K ( SD 2 )

The rationale for Kuder and Richardson's most commonly used procedure is roughly equivalent to:

1) Securing the mean inter-correlation of the number of items (k) in the test, 2) Considering this to be the reliability coefficient for the typical item in the test, 3) Stepping up this average with the Spearman-Brown formula to estimate the reliability coefficient of an assessment of k items.

For example,

ITEM (k) 1 2 3 4 5 6 7 8 9 10 11 12 X x=X- x2 mean (score(Score) mean) 11 10 9 7 7 6 5 4 4 4.5 3.5 2.5 0.5 0.5 -0.5 -1.5 -2.5 -2.5 20.25 12.25 6.25 0.25 0.25 0.25 2.25 6.25 6.25

Student (N) 1=correct A B C D E F G H I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 0 0 0

0=incorrect 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

J =

0 9

0 9

0 8

1 7

1 7

0 5

0 5

0 5

0 4

0 3

0 2

0 1

2 65 mean 6.5

-4.5 0

20.25 74.50 x2 74.5

Pvalues 0.9 0.9 0.8 0.7 0.7 0.5 0.5 0.5 0.4 0.3 0.2 0.1 Q0.1 0.1 0.2 0.3 0.3 0.5 0.5 0.5 0.6 0.7 0.8 0.9 value pq 0.09 0.09 0.16 0.21 0.21 0.25 0.25 0.25 0.24 0.21 0.16 0.09 pq 2.21

Here, Variance Kuder-Richardson Formula 20

p is the proportion of students passing a given item q is the proportion of students that did not pass a given item 2 is the variance of the total score on this assessment x is the student score minus the mean score; x is squared and the squares are summed ( x 2); the summed squares are divided by the number of students minus 1 (N-l) k is the number of items on the test. For the example,

c. Alpha Coefficients Cronbach alpha . Cronbachs can be used to estimate the average reliability coefficient that would be obtained from all possible splits. Cronbachs a is an appropriate method to analyze the reliability of questionnaires that use Likert scales (e.g., strongly disagree, mildly disagree, neutral, mildly agree, strongly agree), since Likert scales give rank type results. Cronbachs Alpha Reliability Coefficient The example given here will show how to use SPSS to calculate Cronbach's alpha reliability coefficient. The data for this example are taken from Markland, Emberton & Tallon's (1997) validation study of the Subjective Exercise Experiences Scale (SEES) for use with children. This is a three factor questionnaire originally designed by McAuley and Courneya (1994) to measure exercise-induced feeling states, the three factors being psychological well-being, psychological distress and fatigue. Each subscale in the SEES has four items and respondents are asked to indicate on a 7-point scale the extent to which they are experiencing each feeling at that point in time. For more information on the SEES, see the Factorial Validity section on the measurement pages. The data can be accessed from the N: drive. Open SPSS then click on File then Open. Now browse through the Look in: box to find and click on the file called sees.sav under N:/resmeth. The dialogue box should now look like this:

Click on Open to open the file. Part of the file is shown below. The data comprise 115 childrens' scores on the twelve items of the SEES. For our example we will calculate Cronbach's alpha for the positive well-being subscale. This subscale is comprised of scores on the items pwb1, pwb2, pwb3, and pwb4. It should be fairly obvious that the psychological distress items are named pd1 to pd4 and the fatigue items fat1 to fat4, should you want to play with them as well.

To calculate alpha, click on Analyze and then Scale from the drop-down menu, and then Reliability analysis. The following dialogue box will appear:

Now select the variables for analysis from the left-hand box and transfer them using the little arrowhead to the right-hand box. In this case we want pwb1, pwb2, pwb3 and pwb4:

Now click on Statistics in order to choose options for the analysis. Click on the three check boxes under Descriptives for. As you can see, there are lots of other options, but we'll keep it simple for this example.

Now click on Continue to close this box and then click on OK to run the analysis. The SPSS output follows in blue, with an explanation of each bit in red. RELIABILITY ANALYSIS - SCALE (ALPHA) Mean 4.8522 4.7913 4.6957 4.7913 Std Dev 1.4464 1.5700 1.5285 1.5922 Cases 115.0 115.0 115.0 115.0

1. 2. 3. 4.

PWB1 PWB2 PWB3 PWB4

RELIABILITY ANALYSIS simply lists the selected variables and gives descriptive statistics, followed by descriptives for the whole scale: Statistics for SCALE Mean 19.1304 Variance 24.2372 N Std Dev Variables 4.9231 4 of

Item-total Statistics Scale mean Item Deleted 14.2783 14.3391 14.4348 14.3391 Scale Corrected if Variance Item-Total if Item Correlation Deleted 15.0272 .6348 14.5068 .6075 14.7742 .6065 13.5945 .6906 Alpha Item Deleted .7677 .7800 .7800 .7394 if

PWB1 PWB2 PWB3 PWB4

Item-total Statistics gives statistics for relationships between individual items and the whole scale. The important bits for our purposes are the last two columns. Corrected item-total correlations are the correlations between scores on each item and the total scale scores. If the scale is internally consistent you would expect these correlations to be reasonably strong. In this case the correlations are all .6 or more, indicating good consistency. The final column tells us what Cronbach's alpha would be if we deleted an item and re-calculated it on the basis of the remaining three items. We'll come back to this below. Reliability Coefficients N of Cases = 115.0 Alpha = .8147 N of Items = 4

Reliability Coefficients gives us the Cronbach's alpha reliability coefficient for the set of four items. At .8147 it indicates good internal consistency. Now, the alpha-if-item-deleted statistics above (in the Item-total Statistics table) show that if we removed any one item, alpha for the remaining three would be worse than alpha for all four items. Therefore it is worth retaining all four. If the alpha-if-

item-deleted statistics showed that removing an item would lead to an increase in alpha, then we would consider doing that in order to improve the internal consistency of the scale. Try re-running the analysis but including the variable "fat1" and you will see what I mean. 5. Agreement The coefficient of agreement is established by determining the extent to which two or more persons agree about what they have seen, heard, or rated. This type of reliability is commonly used for observational research and studies involving performance-based assessments in which professional judgment are made about participants performance. Two important decisions which precede the establishment of a rater agreement percentage are: 1. How close do scores by raters have to be to count as in "agreement?" In a limited holistic scale, (e.g., 1-4 points) it is most likely that you will require exact agreement among raters. If an analytic scale is employed with 30 to 40 points, it may be determined that exact and adjacent scores will count as being in agreement. 2. What percentage of agreement will be acceptable to ensure reliability? 80% agreement is promoted as a minimum standard above, but circumstances relative to the use of the scale may warrant exercising a lower level of acceptance. The choice of an acceptable percentage of agreement must be established by the school or district. It is advisable that the decision be consultative. After agreement and the acceptable percentage of agreement have been established, list the ratings given to each student by each rater for comparison:

Student Score: Rater 1 A B C D 6 5 3 4

Score: Rater 2 6 5 4 4

Agreement X X

E F G H I J

2 7 6 5 3 7

3 7 6 5 4 7 X X X X

Dividing the number of cases where student scores between the raters are in agreement (7) with the total number of cases (10) determines the rater agreement percentage (70%). When there are more than two teachers, the consistency of ratings for two teachers at a time can be calculated with the same method. For example, if three teachers are employed as raters, rater agreement percentages should be calculated for Rater 1 and Rater 2 Rater 1 and Rater 3 Rater 2 and Rater 3 All calculations should exceed the acceptable reliability score. If there is occasion to use more than two raters for the same assessment performance or product, an analysis of variance using the scorers as the independent variable can be computed using the sum of squares. Interpretation of Reliability Coefficients 1. The more heterogeneous the group is on the trait that is measured, the higher the reliability. 2. The more items there are in an instrument, the higher the reliability. 3. The greater the range of scores, the higher the reliability. 4. Achievement test with a medium difficulty level will result in a higher reliability than either very hard or very easy tests.

5. Reliability like validity, when based on a norming group, is demonstrated only for subjects whose characteristics are similar to those of the norming group. 6. The more the items discriminate between high and low achievers, the greater the reliability. Remember that our ability to answer our research question is only as good as the instruments we develop our data collection procedure. Well-trained and motivated observers or a well-developed survey instrument will better provide us with quality data with which to answer a question or solve a problem. Finally, we must be aware that reliability is necessary but not sufficient for validity. That is, for something to be valid it must be reliable but it must also measure what it is intended to measure.

References Berk, R., 1979. Generalizability of Behavioral Observations: A Clarification of Interobserver Agreement and Interobserver Reliability. American Journal of Mental Deficiency, Vol. 83, No. 5, p. 460-472. Cronbach, L., 1990. Essentials of psychological testing. Harper & Row, New York. Carmines, E., and Zeller, R., 1979. Reliability and Validity Assessment . Sage Publications, Beverly Hills, California.

Gay, L., 1987. Eductional research: competencies for analysis and application. Merrill Pub. Co., Columbus. Guilford, J., 1954. Psychometric Methods. McGraw-Hill, New York. Nunnally, J., 1978. Psychometric Theory. McGraw-Hill, New York. Winer, B., Brown, D., and Michels, K., 1991. Statistical Principles in

Experimental Design, Third Edition. McGraw-Hill, New York.

Anda mungkin juga menyukai