Anda di halaman 1dari 35

1

Simple statistical approaches to reliability and item analysis


Data and non-SPSS programs mentioned are on my website
Note that this account is not fully revised for 2012
1. Stats used to check on ones instruments, not to analyse substantive results of a study
Reliability measurement (Rel) and item analysis (IA) are things you do 'on the side' in an
investigation. They are not usually the main focus of the research (unless it is research on testing
itself), but rather ways of checking on the quality of whatever means you use to measure the
relevant variables for the research - whether it be reading proficiency, informants' social class,
learners' mastery of the count/mass distinction or whatever. Along with Rel and IA also belongs
what I call 'case analysis' though this is not a widespread term for it (It will be explained later).
These are not the only things to check on. The other big one is validity (see separate handout):
just to confuse things, though, some people nowadays think of reliability as one form of validity.
The things just mentioned are often checked as part of a pilot study, where means of
measurement or 'instruments' are tried out small scale before being used for real in 'the main
study'. They then form the basis for revising the test, scoring system, instructions to raters, or
whatever ready for the main study.
Analysis of Rel and IA may also be done in a main study, and reported in a separate section from
the substantive 'results'. However, if the reliability of the instruments is not good, at that stage
there is little one can do about it, and the real results one is interested in will be less credible. If
you are aiming to show that motivation affects people's success in language learning, say, you
will not convince anyone if the rating scales, tests or whatever you use to measure 'motivation'
and 'language learning' are shown to not give consistent results in the first place.
The basic statistics for examining reliability and doing item analysis are mostly just special uses
of general stats used for other things as well. We have seen many of them already. However,
here the focus is on comparing means of measurement with themselves, rather than comparing
different groups of people, different conditions etc. (More complex reliability stats are described
in Rietveld and van Hout, and Dunn).
2. Introduction to reliability.
Some means of measuring or quantifying language is said to be, in the technical sense, 'reliable',
if it gives virtually the same scores, rank positions or categorisations to the same cases when
applied repeatedly in the same way in the same conditions by the same or a comparable measurer.
Of course that says nothing about whether the test, questionnaire item, observation schedule or
whatever quantifies what you want it to quantify. That is a matter of 'validity'. But if a measuring
technique is reliable, you can at least be confident that when used on any occasion it is recording
more or less the "true" score/categorisation of each individual measured for whatever variable it
measures. So-called 'random errors' or 'measurement errors' are minimal - there is relatively little
misclassification, or awarding slightly higher or lower marks than should be on any occasion.
It must be noted that what constitutes "the same score" etc. in the above characterisation differs
depending on whether quantification is in the absolute or relative sense - see below. Indeed some
prefer to use the term 'dependability' rather than 'reliability' when considering the present matters
for data scored in some absolute way. This account attempts to cover both, but will stick to the
2
term 'reliability' for all.
Details of possible causes of unreliability and how to eliminate them are not discussed here (see
for example Scholfield, 1995: chs 19, 20). Rather we focus on the statistical measurement of the
reliability of any measuring instrument or technique. This would of course be used as a guide to
revise some aspect of the measurement method, test, classification procedure etc. to reduce
unreliability in future, or at least taken into account when interpreting any scores obtained: these
matters are not pursued here either.
The simple approaches to quantifying reliability reviewed here all require two or more parallel
sets of scores/categorisations of the same cases, obtained typically in one of the following ways.
I.e. these are the fundamental DESIGNS of reliability study. (Note: IA applies only in situation
D, 'Case analysis' can be done in all of A-D).
(A) Two or more different scorers, researchers, raters etc. ('judges'), of the same general type,
score/rank/categorise the same cases (people, sentences etc.). For instance the same set of English compositions
obtained from a group of learners are double marked (blindly) by competent teachers, or ones trained by the
researcher on the same scoring system. The coefficient resulting from analysis of the two or more sets of scores
etc. is regarded as measuring 'interjudge /inter-rater reliability'.
(B) The same judge rescores/recategorises the same cases on two or more occasions, using the same primary data.
For example he/she re-marks the same set of English compositions, after a suitable time gap so that he/she has
forgotten specific marks given. The coefficient resulting from analysis of the two or more sets of scores etc. is
regarded as measuring 'intrajudge reliability'.
(C) The same cases are measured/categorised on two or more successive occasions by the same measurer, using
either the same or an equivalent test or other instrument repeatedly. For instance, the MLU of some children is
measured from speech samples elicited in the same way on two successive days. Or learners take three equivalent
40 item vocab size tests one after the other instead of one long one. The coefficient resulting from analysis of the
two or more sets of scores etc. is regarded as measuring 'test-retest reliability' or 'stability'.
(D) Scores are obtained on just one occasion from one group of cases by one measurer, using some instrument
where an overall score is calculated for each case from a lot of items scored separately, like a multi-item test or an
attitude inventory with a gamut of agree/disagree items. There is then internal analysis of the scores: typically two
scores are obtained for each case by dividing the set of items into two and calculating a score on each half. (Note
that is not dividing the group of cases into half and comparing scores between halves in that sense!). The
coefficient resulting from analysis of the two sets of scores is regarded as measuring 'internal consistency'
reliability.
Each of these in effect focusses on a different aspect/cause of unreliability - it is up to the
researcher to decide which aspect(s) are the ones most in need of checking in a particular study.
For instance if the measurement is by a test with straightforward correct/wrong items, it may be
assumed that reliability as per A and B may not be a problem, but D may be worth investigating.
If the measurement involves coding of strategies in transcripts of think aloud protocols, then A
and B may be more at issue, as even experts are very likely to disagree on what is an instance of
this or that strategy.
In all these approaches the cases should be of the same sort, a homogeneous group, and probably
at the very least 20 in number to give a safe estimate of reliability. They should be the same sort
of people as the instrument is to be used on 'for real' later. All 'facets of observation' are in
principle kept constant or are random alternatives of the same type. By that is meant that
3
rescoring, retesting etc. is done in the same or equivalent conditions (room, time of day etc.) with
the same instructions etc. Either the same test or whatever is used repeatedly, or 'equivalent'
parallel forms. The aim is not to compare even slightly different means of quantification, and
certainly not primarily to compare different types of people or different conditions: focus is on
analysing the instrument itself. There are numerous problems with keeping things genuinely
'identical' when measuring the same people twice and so forth that cannot be pursued here (see
for example Scholfield, 1995 ch 19, 20; Bachman, 1990, ch 6).
More complicated reliability studies are often combinations of the above. For example, the same
cases are measured on several occasions and with several measurers. Further elaborations can
involve varying substantive factors usually held constant. For instance one can remeasure more
than one group of cases, to see if reliability is different when different types of case are involved.
Similarly one can vary the type of measurer (e.g. teacher versus researcher), the precise
conditions of testing etc. to see their effects on reliability. We stick with simple approaches here.
The simple approaches to quantifying reliability all yield two or more parallel sets of figures for
the same cases. I.e. the data looks like that used in repeated measures and correlation studies we
have met before. Sometimes, especially with interjudge reliability, there are three or more sets of
scores/categorisations to examine. To obtain a coefficient of reliability, the sets have to be
analysed statistically for closeness or agreement in some way. The reliability measures are then
interpreted in part depending on how the repeated scores were obtained - from repeated judges or
occasions or what - and in part on what statistical measure of agreement or the like was used
(sketched here).
Some of the reliability measures used are ones we have met before (like the Pearson r correlation
coefficient), others not. But either way, you need to think a bit differently in reliability studies
from the way you do in analysis of general research results. In particular, significance tests and p
values are not of very great interest in reliability work and are not usually quoted. This is because
when you are comparing a measurement 'with itself' in the above sorts of ways (A-D) it would be
remarkable to fail to get a significant amount of agreement. What is crucial is not just significant
amounts of agreement or relationship or whatever, since 'significant' means no more than
'definitely better than zero'. To show a measurement is reliable you want high actual levels of
agreement or relationship etc. - e.g. Pearson correlations close to +1. In short, in reliability work
you mainly use just graphs and relevant descriptive statistics to show what is going on.
The choice of appropriate statistical measure of reliability depends on a number of things,
especially:
(1) whether two or more than two sets of repeated scores or categorisations have to be
considered (arising in designs A-C above), or an internal analysis of one set (from design
D above)
(2) the scale type of the variable (interval, rank order, ordered categories, nominal
categories).
(3) whether the scores/categorisations are considered as having relative or absolute value.
These three considerations are used to organise the overview of reliability coefficients below.
Scale type is essentially the matter of whether the figures obtained have been derived in such a
way that they have to be regarded as 'interval' or 'ordinal' or labelling a 'nominal' classification.
Typically scores from a test or MLUs or error counts and often rating figures will be regarded as
interval, i.e. having the full properties we expect of numbers. If cases are given numbers
indicating just who is better than who, but not by how much (rank position), then that is ordinal
4
data. If one assigns cases to ordered categories, such as social classes A, B1 etc., that too is
ordinal. If numbers just label categories with no order involved, as when one assigns errors to
five different types (grammatical, lexical, spelling etc.), or classifies people as having either cleft
palate or normal palate, then that is a nominal categorisation. (For a fuller account see Scholfield,
1995: chs 11-18).
The relative/absolute distinction is particularly important since in everyday parlance when we
talk about the "closeness" or "amount of agreement" of two sets of scores or categorisations we
probably think first of some absolute correspondence. However, the most popular reliability
coefficients cited in studies are based on treating scores as relative: essentially they are those we
met in the two variable correlational design - measures of symmetric relationship of the highest
type the scale allows - preferably linear.
In a nutshell, absolute measurement is done with an eye to some predecided standards of
language ability or performance which decide what score a case gets, or what category they are
placed in. Relative quantification on the other hand is done more with an eye to the ability or
performance of other cases measured, relative to which an individual gets a score or is placed in
a category. Thus an absolute, 'criterion-referenced', test of receptive knowledge of English
phrasal verbs would aim to quantify how many phrasal verbs each case actually understands, as a
proportion of all such verbs. A relative, 'norm-referenced', test of the same thing would be
concerned rather with discriminating which cases know more phrasal verbs than which other
cases (and not particularly quantify how many anyone actually knows).
In general, interval scored data may be either relative or absolute, as may data in ordered
categories. One must think hard to see which such data is best thought of as trying to be, given
the type of instrument used. On the other hand rank ordering can only really be relative and
nominal categorisation only absolute. If unfamiliar, the difference will become clearer as its
consequences are seen below. (For further elucidation see also Scholfield, 1995: ch 10).
If we think just of interval scores, we can see the essential consequences for measuring reliability
this way. With the absolute or criterion-referenced view of the scores, we would say
measurement is reliable if on the two occasions or with two scorers or whatever each case gets
almost exactly the same score. For instance, someone who scores 75% with one scorer, scores
76% with the other. Statistically this is a matter of 'similarity' or 'difference' or 'proximity'.
With the relative/norm-referenced/psychometric view of scores we would say measurement is
reliable if on the two occasions, or with two scorers or whatever, each case is placed in more or
less the same position relative to the scores of the group. For instance, someone who is just
above average, and in 14th place from the top in a group of 35 with one scorer comes out almost
the same amount above another scorers average and in 15th place with them. Statistically this is
(in simple approaches) a matter of 'correlation'.
The two views are not identical, because in the relative version the actual scores awarded to the
same case by two scorers or whatever could be more different. Reliability might still be regarded
a high provided the scores of the group had all shifted in parallel up or down between the two
occasions or scorers.
We now attempt an overview of simple statistical means of quantifying reliability.
5
RELIABILITY IN SITUATIONS A-C
3a. Two occasions or judges: data viewed as interval with relative value
Main stats: scatterplot, Pearson r.
The following simple sets of data will enable us to see what the most commonly cited measure of
reliability, the 'Pearson r correlation coefficient' actually tells us.
Each of the following sets of figures is the imagined results for four subjects measured or marked
twice on the same measure (reliability design C), or by two markers (A). The sets are
deliberately ridiculously small both so as to be quick to enter in SPSS and so as to be especially
easy to see by eye how similar the two sets of results are.
For each set
(a) enter the figures in the Data grid
(b) get the Pearson r correlation coefficient calculated
(c) get a scatterplot made, with suitable labels for the axes. In order to be able to see
which case is which on the scatterplot, it is also useful to edit it to label the squares with
case numbers: Edit the graph: Chart.... options... Case labels... set them to On.
(d) answer the questions below
If not sure how to do a-c, look back esp. at LG475 SPSS tasks for step by step instructions.
Set 1
First time Second time
Subject 1 13 13
Subject 2 18 18
Subject 3 16 16
Subject 4 21 21
Does this look like a reliable measure? Are the columns of results 'agreeing' closely?
Does the value of Pearson r reflect your intuition?
If not, why not?
Does the scatterplot reflect your intuition too?
How does the scatterplot relate to the value of r?
Set 2
First time Second time
Subject 1 13 20
Subject 2 16 22
Subject 3 18 23
Subject 4 21 25
Does this look like a reliable measure? Are the columns of results 'agreeing' closely?
6
Does the value of Pearson r reflect your intuition?
If not, why not?
Does the scatterplot reflect your intuition either?
How does the scatterplot relate to the value of r?
Set 3
First time Second time
Subject 1 13 13
Subject 2 16 16
Subject 3 18 21
Subject 4 21 18
How does this set differ from set1 and 2?
Intuitively is agreement better or worse than set 2 do you think?
What is r for this, and what does it reflect? Does it show this set has higher or lower
reliability than set2?
Examine and comment on the scatterplot.
Set 4
First time Second time
Subject 1 13 13
Subject 2 16 14
Subject 3 18 20
Subject 4 21 21
How does this set differ from the ones above?
Which set above is it most like?
What r do you expect to get?
Do you get it?
Look at the scatterplot and see why. r is obviously sensitive to .... what aspect of the
scores?
Set 5
First time Second time
Subject 1 12 12
Subject 2 16 16
Subject 3 19 22
Subject 4 22 19
How does this set differ from the ones above (esp. set 3)?
Which of them is it most like?
What r do you expect to get?
7
Do you get it?
What might be causing the result?
From this you should get a feel for what sort of 'reliability' the Pearson r measures:
Perfect reliability: what does r = ?
Good reliability: r should be greater than what, do you think?
Is it true that r is greater if the corresponding scores in the two sets are close to each other
in value?
Is it true that r is higher if scores are spread farther apart within the sets? (i.e. greater SD
and variance)
Is it true that r is greater if the two sets of scores are in the same order?
Do you think reliability measured this way is satisfactory for most research? Why?
Is r as in 'Pearson r' short for 'reliability'?
3b. Two occasions or judges: data viewed as interval with absolute value
Main stats: histogram of absolute differences for cases, mean absolute difference.
You may feel that for some purposes the sort of similarity of relative order and distance that r
reflects is not the sort of 'sameness' you want to measure as an indication of reliability. If people
are being graded on a number scale from 'native speaker proficiency' = 20 to 'complete beginner'
= 0, then you would want people to get the same actual number grade, whoever measured them,
if the measure is reliable (not just be placed in the same order). Then you need a different
measure of agreement.
How could you measure how close to being identical in some absolute sense two columns
of scores are?
Could you just subtract each of one column of scores from the corresponding one in the
other column and add the differences?
Try doing this by hand for the above sets of data and see if it works.
In fact to get a useful absolute agreement measure you have to add a proviso to how you
calculate and add the differences. You need the 'mean absolute difference' measure of reliability.
See if you can get SPSS to calculate it for you for one of the above sets of data. Set 4 is a good
one to try.
To go about it you click the Transform menu and choose Compute. You are offered a screen
where you can generate a new column of figures (Target Variable) by formula from ones you
have already. In this instance let's call the Target Variable diff. Click your existing columns into
the Numerical expression box top right and connect them using ABS from by Functions list.
Use a formula such as ABS(col1 col2)
Having got your new column, look at it in the data editor window
8
How have the differences been calculated? Check a few to see they are right.
How does the 'absolute difference' between scores differ from just the 'difference'?
Get the histogram and descriptive stats for that new difference column.
A bigger mean absolute difference means ... greater or less reliability?
Does this mean, as a reliability measure have a logical maximum and minimum value?
Can you see how the histogram would enable you to spot the cases that were most
disagreed about?
What shape would you expect the histogram to have: not the 'Normal' distribution
shape.... Why?
You may have had enough of this toy data by now. So look at a set of real data. I have given the
data not as an SPSS data file, ending in .sav, but a plain text file agjudg2.dat where the data is
just wordprocessed in columns with spaces between.
Sidepoint. Opening a file composed on wordprocessor or the like, and saved as plain text/DOS
text, as a file with the ending .txt or .dat.
Simplest way is to use the menu option File Read text data. Click on the arrow beside the
Files of type space and browse for files ending in .txt or .dat.
When you choose the file, the Text import wizard should be activated, and you can go through
it probably without altering any of its choices to get the data loaded satisfactorily.
This illustrates how you can get data into SPSS without retyping it even if you have it only
originally in wordprocessed form, or scanned in from a book, or as output from some other
software that saves data as plain text files
Two markers, P and V, marked a set of 45 undergrad applied linguistics exam
scripts without seeing each others' marks. What is the reliability of their marking?
Which reliability design is involved (A-D)?
Get the data from my disk - file agjudg2.dat and calculate both the relative and
absolute types of reliability coefficient as described above. Get the scatterplot, with
cases labelled (when editing the graph, use Chartoptionscase labelsOn),
and the histogram of absolute differences.
Interpret them
Is the reliability good on either the relative or absolute measure?
Remember the max and min values possible for Pearson r are ....what?
The max and min possible for the mean absolute difference depend on the length of
9
the scale available.
If DLL in practice gives marks only effectively between 35 and 80, what are
the max and min possible mean absolute difference to compare the obtained
value with? Which end is 'good' i.e. to show high absolute reliability do we
want a mean abs difference that is small or large?
See how the scatterplot shows which students the markers disagreed on most.
This is what I call 'case analysis': looking at the individual people or whatever who
seem to be off-line (outliers) causing disagreement and trying to explain why (and
in research perhaps eliminating them from the sample if there is good reason).
Do the same cases come up as 'odd' on the histogram as on the scatterplot? If
not why not?
Would you say university exam marking is supposed to be on a relative or
absolute scale?
So which reliability coefficient is more worth looking at in this instance?
Does it inspire confidence in the double marking system? Think carefully.
Can you get SPSS to tell you which marker was more 'generous'?
Absolute reliability is rarely reported by researchers: but I think that is because all the stats books
are fixated on relative measures like Pearson r. The absolute mean difference does not have a
maximum of +1 meaning 'perfect reliability', of course, but can be turned into such a measure by
putting it in the formula:
Absolute measure on scale 0 to1 = 1 - (Mean abs. diff. / Length of interval scale)
Check that makes sense for the ones we have used above.
3c. Two occasions or judges: data rank ordered with relative value
Much as for 3a, using Spearman (rho) coefficient instead of Pearson r.
4a. Three or more occasions or judges: interval data with relative value.
Main stats: scatterplots, Cronbach's alpha.
It is not common except in professional testers' reliability studies to obtain more than two sets of
scores from supposedly identical quantification occasions to compare. However, quite commonly
it is both easy and useful to use more than two judges of the same sort to score or rate the same
cases. You can of course use the relevant previously described approaches, both visual and
numerical, to look at the correlation between each pair of occasions or judges separately, but
ideally you would like an overall measure of the amount of agreement as an indication of
collective reliability. Hatch and Lazaraton p533 give a method that relies on averaging the
correlations between pairs of judges.
A standard solution is Cronbach's alpha (Greek letter ), which can be thought of as a bit like a
Pearson correlation coefficient across more than two occasions or judges of the same things. Its
10
maximum is +1, but it can achieve peculiar negative values (theoretically of any size) in some
odd situations. It has similar properties to Pearson r in that it is not so much affected by the
correspondence of absolute sizes of the scores, more by the agreement in order and spread of
them between particular columns (=occasions or judges). SPSS also produces useful statistics
confusingly called 'alpha if item deleted' (For 'item' read 'occasion or judge' in the present
discussion). This tells you what the alpha would be if a particular judge was left out and alpha
just calculated from the others. Obviously if the alpha goes up drastically when a judge is left out,
that suggests that that judge was not rating very much in the same way as the others, but is
somewhat of a maverick. One might not trust that judge in future (unless of course that one
turned out to be you, the researcher!).
To try out an analysis with alpha, get hold of my data agjudg3.dat (again this is not
an SPSS file). This consists of ratings by three judges of 14 learner compositions
(design A again).
The three markers clearly did not use the same scale to mark on, so whether or not
each was marking on an absolute scale, we can only treat their marks as relative
scores for further analysis. The reliability question that alpha settles is, 'To what
extent are the three markers' scores mutually correlating positively?'
First get the visual picture by making scatterplots for each pair of markers as
before.
Three markers, so how many scatterplots needed to see all possible pairs?
By inspection of the scatterplots, which pair of markers agrees better? Which
marker is the odd one out?
As usual look for cases that are off line - compositions most disagreed about.
Get the correlation calculated between each pair of judges. When you ask for this
in SPSS just put all three variables in at once and practise spotting the informative
bits of the 'correlation matrix' which appears in the output (remember SPSS gives
you everything twice in this, plus other rubbish!).
Now go to Analyze...Scale... Reliability and under Statistics get Scale if item
deleted. In the output remember that 'items' means 'judges' in this instance.
How high is alpha? Is it convincingly showing high agreement of judges and
so interjudge reliability? Remember that reliability coefficients of the
correlation type need to be high in absolute terms (over .8) to be impressive.
Which judge comes out as the rogue one? The same as we saw from the
correlations?
4b. Three or more occasions or judges: rank order data with relative value.
Main stats: scatterplots, Kendall's W coefficient of concordance.
A very useful simple measure is 'Kendall's W', also called 'Kendall's coefficient of Concordance'.
This is actually a technique for quantifying the amount of agreement in several rank orderings of
the same cases, though it can also be used for interval data like that in 4a above (with loss of
11
some information). Often it is used where one doubts if interval data really is equal interval.
You could use this as a cautious choice where cases have been given several scores that might be
regarded as interval, say by several teachers rating the same essays in a non-absolute way, but
where you feel doubt as to whether they really observed equal difference between rating levels.
In their marking, is an essay rated as worth 11 different from one rated 10 by the same amount of
quality as one rated 4 is different from one rated 3, for example? Often one cannot be sure in
instances when the figures come from people rather than from instruments like computers
measuring response times. One opts to treat it as an interval scale or not on ones own judgment.
With the W measure, the data is treated as having been rank ordered only by each judge, and
some information is lost.
Kendall's W is like a Spearman rho correlation for three or more rank orderings, and comes out
between 0 and 1, with 1 indicating perfect 'agreement' (see e.g. Cohen and Holliday, 1982 for
calculation). Parallel to what has been said already, if three scorers differ in some way in terms
of actual level of score, e.g. A always gives higher marks than B and B higher than C, this will
not be reflected in such an interjudge reliability coefficient. That is quite in order for a norm-
referenced technique but of course is not really satisfactory for a criterion-referenced one. For
instance Fischer (1984) got ten untrained raters to use his communicative rating measure on 18
learner texts and reports a W of .725. This indicates a fair level of agreement in the order of
merit the raters assigned to the texts, but says nothing about whether they agreed much in the
actual level of communicativeness they attributed to each.
To try out an analysis with W, use my data agjudg3.dat again.
The reliability question that Kendall's W answers is 'To what extent are the three
markers putting the compositions in the same order of merit?'
We have already analysed the scatterplots, so go direct to get Kendall's W
calculated with SPSS to get the overall reliability of the three markers together.
There is a slight problem here. One would normally have the data entered as in my
file, with the 14 cases measured (the compositions) down the side forming the
rows, and the three judges across the top, one per column, but this is not the way
SPSS requires it. SPSS will calculate a result from this, but it will be the wrong
one. To get it done correctly, you must first make the rows the columns and vice
versa.
To do this is easy in SPSS. Click the Data menu and then choose Transpose.
Complete the dialog box by getting the three variables into the right hand space and
your data will get rearranged to be three rows and 14 columns, plus an extra
column put in automatically labelling the three judges. The columns will be
labelled with labels like var0001 in the usual default manner.
From now on SPSS will refer to the three judges as the cases and the 14
compositions as variables, which is confusing, but there you are....
Now go to the Analyze menu ....Nonparametric ....K related samples ....and click
the box to activate Kendall's W. Click to deactivate Friedman's test. Make sure all
14 'variables' are included.
From the output you will get a value of W
12
How high is it on a scale 0 to 1? Similar or not to alpha?
Is it convincingly showing high agreement of judges and so interjudge
reliability?
Remember that reliability coefficients on a scale 0-1 need to be quite high in
absolute terms (over .8) to be really impressive.
You will also get a level of significance,
What is it?
Does it show the value of W is significant?
Why do you think that in reliability study people don't pay much attention to
significance levels (p values)?
Look at the Mean Rank part of the display. Remember the statistical
procedure has simply turned each judge's marks into a rank ordering of the
14 students' compositions. SPSS is displaying here the average ranking of
each composition over the three judges. You can do some useful 'case
analysis' (currently being called variables by SPSS, remember):
Which composition is being judged collectively best, which worst? (Does
SPSS rank from 1=best to 14 = worst or the reverse?)
If you have them, look at the actual compositions and see if you are surprised.
Where are the judges having trouble agreeing on a clear difference of
quality? Look for pairs of compositions where the rankings are tied.
4c. Three or more occasions or judges: data viewed as interval with absolute value
Main stats: Histogram of individual SDs of cases, Mean SD
For interval measurement with some absolute value, such as criterion-referenced tests and many
counts and ratings, alpha or Kendall's W is not so suitable. Since the actual scores "mean"
something, to be reliable, you'd want any such measurement to yield the same actual marks not
just marks in the same order, when different testers/raters etc. are involved. You can of course
use the method of 3b repeatedly. Or a simple approach to reliability here is to calculate a
standard measure of spread, such as the 'standard deviation' (SD), to quantify the closeness of the
three or more scores of each case. SD is described in any elementary statistics text. The average
SD over the group of cases measured then serves as an absolute 'difference' measure of reliability.
You can do this following the procedure for making a column of absolute differences: except
you ask for the SD of the set of columns instead of ABS in the formula.
As for the abs. differences in 3b, we can construct a frequency histogram and see what is the
commonest SD, and the largest SD - i.e. the most unreliable case - and so on. Mean SD comes
out as zero if the three or more scores of each case are the same, albeit the scores are different for
different cases. A higher SD figure reflects lower agreement between the repeated scores. The
13
logical maximum average SD, signalling maximum difference, is half the length of the score
scale (except for some rare situations where it can be a little greater). So it is 50 for scores in
percent, 3 for a 1 to 7 rating scale (because 7 minus one divided by 2 = 3). The possible
maximum must always be borne in mind, since an average SD of, say, 2.5 will signal quite a
different degree of unreliability, in this absolute sense, depending on it. Like at the end of 3b we
could re-express a mean SD as a coefficient with maximum value 1.
You can do an absolute reliability analysis like the above quickly using my ABSREL2 program.
You save your data as Fixed ASCII from SPSS, in a file with suffix .dat. Get my program from
me or the website. It is a clunky old executable file that runs in DOS. Run it to get the full
instructions; follow them and look at its output file. It does a few things that it takes time to get
SPSS to do. It identifies the rogue judges and cases automatically.
Note, sometimes the means of the three or more repeated tests or measurements are presented as
evidence of this sort of reliability. However, as the examples below shows, a test on three
occasions can have the same mean score but still not be reliable in this absolute sense. Imagine
three subjects have their vocab size measured by equivalent tests on three occasions. The sizes
the test gives for their vocabularies are as follows.
Example 1
Occasion 1 Occasion 2 Occasion 3 SD
S1 200 200 200 0
S2 400 400 400 0
S3 600 600 600 0
Mean 400 400 400 0
Here all subjects score the same on each occasion/test, so absolute reliability is perfect. This is
reflected in the fact that the SDs are all zero so mean SD is zero. The mean of each test is also
identical.
Example 2
Occasion 1 Occasion 2 Occasion 3 SD
S1 200 200 200 0
S2 500 600 700 100
S3 500 400 300 100
Mean 400 400 400 66.7
Here all subjects do not score the same on each test, so absolute reliability is not perfect. This is
reflected in the fact that the SDs are not all zero so mean SD is not zero. However, the means of
the equivalent tests are still identical, so this is not a good guide to criterion referenced reliability
here (the correlations between occasions are also high, showing good norm referenced
reliability).
14
5. Two or more occasions or judges: data in ordered categories.
Skipped here
6. Two or more occasions or judges: ordered/unordered categories with some absolute
value.
Main stats: proportion or percentage agreement, Cohens kappa, Jaccard coefficient
6b Two judges (or occasions): category data with absolute value.
When cases are put into categories by two judges, percentage agreement is the simple measure of
agreement widely used. That is, out of all the cases/people that the judges categorised, what
percent did they both put in the same category? However, the argument is often produced that if
the number of categories is small, two judges would produce some apparent agreement even if
they were randomly putting cases in the two categories, and one needs to adjust for that.
Example: two examiners look at written essays by 20 people and make a judgment on each
person either that the person shows adequate mastery of academic English for university study or
not. If they each in fact made no rational judgment but just randomly placed each student in the
mastery or non-mastery category, how much agreement would we get? Answer, there are 4
possible outcomes for each person judged: both judges say yes, both say no, the first says yes the
second no, the first says no and the second yes. Each of those is equally probable, so in principle
there would be agreement on 10 of the students, i.e. half, on that basis..
Some experts therefore like Cohens kappa which claims to adjust for random agreement.
Personally I still like the % agreement, for three reasons.
a. When talking about reliability, we are concerned with judges (examiners, researchers,
raters etc.) who are NOT likely to be actually placing people randomly into categories,
but trying their best to do the job rationally. Hence do we need an adjustment assuming
that they are doing this to some extent? If it was students, children or other people with
no expertise in the categorising being done who were doing it, then we might take a
different view. I.e. such people might be blindly guessing.
b. The more categories are involved the smaller the chance agreement rate becomes
(assuming equal distribution of judgments over all the possible combinations).
15
Number of categories Chance agreement as % of cases judged
2 50
3 33.3
4 25
5 20
6 16.7
7 14.3
8 12.5
9 11.1
10 10
In language research often the categorisation is, say, of chunks of interview transcript
(the cases) into a qualitative coding system the researcher has developed, which may
contain 30 or more categories. The amount of chance agreement here is negligible (3.3%).
c. There have been many criticisms made about Cohens kappa from a technical point of
view. For instance the amount of agreement it shows varies not just depending on how
many agreements there are, but on the precise proportions of different types of
agreements and disagreements. Do we really want that? Thus if 2 judges put 20 people
into two categories A and B as follows, kappa comes out as 0 (as we might expect
because the agreements are not better than the chance rate)
First judge
A B
Second judge A 5 5
B 5 5
But if the disagreements are spread as we see here, kappa = .2, even though the number
of agreements on the diagonal is the same.
16
First judge
A B
Second judge A 5 0
B 10 5
And if the judgments were distributed like this, kappa changes again to .083.
First judge
A B
Second judge A 9 0
B 10 1
Also, the basic kappa usually calculated (as above) does not have a possible maximum
value of 1 for most data, but less than 1, so it is hard to evaluate how great the agreement
really is that is being reported: if kappa comes out at .36, but we dont know what the
maximum is that it could be for our data, what do we make of it? (see further below on
how to remedy that).
However, since it is popular, here is a bit on the kappa coefficient (Cohens kappa).
Take this example. Two markers rated the same 19 students for speaking ability on the Common
European Framework of Reference scale which runs from A1 the lowest to C2 the highest ability
grade.
A Basic Speaker
A1 Breakthrough or beginner
A2 Waystage or elementary
B Independent Speaker
B1 Threshold or intermediate
B2 Vantage or upper intermediate
C Proficient Speaker
C1 Effective Operational Proficiency or advanced
C2 Mastery or proficiency
When listed as categories the data looks like this:
B2 B2
A2 A2
B2 B2
C2 C2
17
B1 B1
B2 B2
B1 B1
B1 B1
B1 B2
B2 B2
B2 C1
B2 B2
C1 C1
C2 C2
A2 B1
B1 B1
C1 C1
B2 B2
A2 A2
The percentage agreement we can calculate from that as (16/19) x 100 = 84.2.
Advocates of Cohens kappa however argue that even if the judges/raters/scorers were randomly
assigning grades to students there would still be some agreement, and that needs to be discounted
from the agreement measure. That is what kappa claims to do.
When displayed as a contingency table (= SPSS Crosstabs), the data looks like this:
judge2 * judge1 Crosstabulation
Count
judge1
Total A2 B1 B2 C1 C2
judge2 A2 2 0 0 0 0 2
B1 1 4 0 0 0 5
B2 0 1 6 0 0 7
18
C1 0 0 1 2 0 3
C2 0 0 0 0 2 2
Total 3 5 7 2 2 19
Kappa is done by SPSS under Analyze.... Descriptive statistics... Crosstabs... Statistics. SPSS
gives the value of unweighted kappa and the significance value but not certain other kappas
which are more useful. For those use the website http://faculty.vassar.edu/lowry/kappa.html
Instructions...
1. Generate the Crosstabs table in SPSS (and get kappa and its sig if you like)
2. Go to the website above and Select number of categories and choose 5 (since this data
involves placing cases into 5 categories A2 B1 B2 C1 C2)
3. In the yellow spaces in the grid just enter the raw numbers exactly as in the SPSS
crosstabs table. Think of category 1 as A2, category 2 as B1 etc.
4. Click Calculate and look at the following output:
Unweighted kappa observed (that should be the same as what SPSS produced and in published
work is what is usually quoted as kappa)
Unweighted kappa as proportion of maximum possible (that has the same p value as the
preceding)
Linear Weighted kappa observed
Linear Weighted kappa as proportion of maximum possible
Unfortunately there are various pros and cons one could debate as to which of these 4 is the best
figure to use as a measure of interjudge agreement. And none of them get rid of some of the
objections mentioned above. Whether you use kappa or not rather depends if you might have an
examiner who would demand it or not! Here is my judgment:
The two observed kappas do not usually have 1 as their potential max value, so they are rather
difficult to interpret and compare. Hence I would not use them. In reliability work one wants to
know how big the agreement is, and wants it to be high. One is less concerned with significance,
since it would be remarkable if two markers did not agree enough to be significant, even if their
agreement was far from perfect. So we look more at the descriptive measure of agreement (such
as we did with Pearson r etc. earlier). Hence a descriptive statistic that does not have a known
maximum possible value is not much use as we cannot tell how high it really is. The kappa out
of maximum possible figures make more sense as they do have a max of 1 more like correlation
coefficients. So I would use one of those, but make it clear in the writeup that that is what is
being cited.
If your data involves cases placed into just two categories (e.g. students judged by more than one
examiner as pass or fail for something), or more than two but unordered nominal ones (e.g.
words judged by more than one expert to be general English versus academic English versus
technical terms), then only look at the unweighted figures.
If the categories are logically ordered, as the CEFR grades are, then the weighted kappa is
probably preferable. It differs from unweighted in that it gives some credit to scorers where they
differ in only one scale category rather than two. I.e. if both judges say a case is B2, then perfect
19
agreement is recorded, if one judge says a case is B1 and the other says B2 they get recorded as
partially agreeing while if one says B1 and the other says C1 they get a lower credit for
agreement, and so on.. Unweighted kappas only give credit to perfect agreement.
6d. Two judges (or occasions): category data with absolute value where the categories
themselves are not given in advance.
When researchers categorise data from open response sources like think aloud protocols,
interviews or observation, they first have to adopt someones classification scheme or develop
what they think is a workable and valid system of classification of their own. This revising and
developing of the set of categories to use requires work with other experts (e.g. the PhD student
gets the supervisor to do some independently and they discuss): this is more a matter of content
validation than reliability. Thus reading strategy researchers need to work out a suitable set of
different strategies, with a definition of each, which they look for evidence of their subjects using:
this could come from the lists of other researchers, or in part from their own data. The motivation
researcher, having interviewed learners and asked their reasons for learning a FL, has to establish
a set of distinct types of reason that they mention, again using the lists of other researchers (e.g.
Gardner) and/or relying on analysis of what their own subjects actually suggested (maybe using
qualitative analysis software to help). The observer of classes, looking for how teachers
introduce new vocab, may have a checklist of types of ways that can be done prepared in
advance to tick off, but may modify this if the teacher uses some technique not on the list.
Now once the data has been gathered and gone over in order to finalise the classification system
itself, then the researcher applies the final version of the classification system to all the data.
Then of course he/she can count up how many times each subject used this or that strategy, how
many learners mentioned for my job as a reason for language learning, how many teachers used
translation to present the meaning of new vocab, etc. etc. At this point it is common good
practice to get another person or people, trained in the classification system, to go over a sample
of the same data and see if they classify it the same way as the researcher. This is the reliability
checking part of the data analysis. It allows you to show that different analysts can use the same
set of categories consistently, and mostly agree on which bits of data fall in which category.
Hopefully there will not be too many instances where one analyst thinks what a reader said is
evidence of them using the paraphrase strategy while another thinks it is an instance of the
summarising strategy Obviously you need a measure of agreement to report: some form of
proportion or percentage of instances agreed on by both judges. In principle this is, as a
proportion,
Number of cases placed in the same category by both judges
Total number of cases
(where cases are often not people but bits of data from within people, e.g. instances of strategy
use). For percentage agreement multiply by 100.
A problem that arises for measuring agreement here is that the set of things to be categorised is
not determined in advance, but somewhat fluid and decided by each judge for him/herself. This
may arise for instance in categorising strategies from 'oral protocols'. Suppose the same tape of
someone doing a think aloud task while reading is given to two judges to identify and categorise
all the reading strategies identifiable from it. Not only may they disagree on how to categorise
particular chunks of what is said, but also on whether a particular bit of talk gives evidence of a
strategy being used at all. So it is not so straightforwardly 'one set of cases classified twice' as in
simple reliability situations.
20
Imaginary example. Six people/strategies or whatever classified as either A or B on two
occasions/by two judges, with some missed by one or other judge.
occ./judge 1 occ./judge 2
case 1 A --
case 2 A A
case 3 A B
case 4 -- A
case 5 B B
case 6 A --
(This could be the same cases categorised on two occasions or the same ones categorised by two
judges). There are several ways out:
a. You take one occasion/judge as the gold-standard and disregard any cases that were
categorised by the other one only. E.g. if you take occasion/judge 1 above as the standard, you
omit case 4 when calculating percentage agreement etc., and just calculate it out of 5 cases. This
however gives precedence to one occasion/judge, when neither is normally to be taken as prior to
the other for reliability purposes. If you do it like this, usually the researcher is the judge taken as
having priority. This is the solution I have seen most often used.
b. You only consider cases in common to both. OK that is even, but may leave a good deal out
(cases 1, 4 and 6 above). Still this might seem best if the cases have left themselves out, as it
were - e.g. people who did not turn up both times for a test-retest. SPSS would calculate
agreement this way, leaving out anyone with missing data in any column. But SPSS is not very
good at this sort of thing and best to just calculate these agreements by hand.
c. You include all, and have an extra category of 'missing' or 'not recognised' for the problem
cases. That might seem fairest for the strategy classification example, as there is then some
record of how far the judges failed even to agree on where there was a strategy at all, let alone
how to classify it. This would yield a contingency table for the above like this:
Judge 1
A B --
Judge 2
A 1 0 1
B 1 1 0
-- 2 0 ??!
However, a problem does arise for the cell marked with a ? above. In principle this cannot be
completed as there is really no way of counting the instances where both judges agreed on not
recognising the existence of a strategy at all in a segment of transcript. This would therefore be a
good candidate for the use of the Jaccard coefficient which is calculated like the proportion or
percentage agreement, but using eight of the cells in the table, not all nine: it omits consideration
of the bottom right cell, the number of cases agreed on as not being whatever it is.
21
In this toy data, what is the % agreement for method a and b, and the Jaccard coefficient
for c?
Which method makes the reliability look best? Which worst?
Yet another solution for c here is to calculate two reliability coefficients (using % agreement or
kappa), not try and treat the problem as a single one:
One would be for the agreement in recognising occurrence/not of a strategy at all (regardless of
the classification of what kind), where n is all the units the protocols are divided up into
(utterances, sentences or whatever). We see the extent to which the judges agree that each
sentence does/doesn't show evidence of a strategy of some sort being used (2 categories).
The other would be calculated just within the set of utterances where both judges agreed there
was a strategy of some sort (option b above). It would focus on how far they agreed in the
specific classification of each strategy as 'prediction', 'planning', etc. (commonly such
classifications contain over 20 strategies).
6e. Three or more occasions or judges: category data with absolute value.
For any sort of absolute categorisations of the same cases done separately by more than two
judges, of course you can simply do the procedures already described for each pair of judges
separately (or see Dunn, 1989: 7.4 or Rietveld and van Hout p221ff for fuller procedure). The
average of the agreements between all pairs of judges is a simple overall measure of reliability.
You can also calculate the overall proportion agreement among 3+ judges as:
Number of cases placed in the same category by all judges
Total number of cases
Here, as usual, +1 would indicate total agreement/perfect reliability. However, this is somewhat
crude: if we do this for four judges putting some learners in pass/fail categories for adequacy in
apologising appropriately in English, it does not take into account, where all four judges do not
agree, how often the split was three-one or two-two. Yet clearly we would feel this adds
something to our impression of reliability. In addition to a high proportion of unanimous
judgements, we would be more convinced of the inter-judge reliability if most of the others were
three-one rather than two-two. More valuable therefore is to calculate and report a series of
proportions (or percentages): the proportion of cases agreed on by all judges, by all but one, by
all but two etc., as appropriate for the number of judges and categories involved.
Equally, for case analysis with four judges, the people/things that were disagreed about 2-2 are
obviously more 'odd' than those that were disagreed about 3-1 or agreed about (4-0).
A table that can be useful in the multijudge situation is one with scale categories across the top
instead of occasions/judges. By this I mean as follows.
Normally the data is in the form (from which SPSS can get contingency tables for each pair of
judges):
Judges 1, 2 3 etc.
Cases Columns contain numbers representing the
measured category each judge placed each case in.

22
But to see the agreement of many judges this can be easier to interpret by eye, especially where
the number of judges is greater than the number of categories:
Categories 1, 2 3 etc. of the nominal scale
Cases Columns contain figures representing the
measured number of judges who put each case in each category.

From whichever you use, you need to summarise for each case, in a new column on the right,
what the agreement is for each case (e.g. if there are three judges and three categories, then it can
logically be either 3-0 or 2-1 or 1-1-1). Then you count the numbers of each type of agreement
and express it as % of all the cases. This is probably simplest done by hand from the columns of
raw figures. I have not found a simple way for SPSS to do it.
6f. The problems described in 6d can also arise again, where one gets three or more people
(including oneself, perhaps) to categorise strategies from transcriptions of taped material.
Real example for consideration (design A). A student got a number of Arabic
speaking learners of English to write compositions both in English and Arabic, and
report in 'think aloud' fashion on what they were doing as they wrote. The taped
think aloud protocols were then each gone through by the researcher who identified
distinct bits of writing behaviour/strategies and categorised them into the major and
minor categories of Perl's system and counted up frequencies of use of the various
behaviours. Perl's system has 23 'major/main categories', with labels such as
'general planning', 'rehearsing', 'scanning back over text so far', 'revising'. There is a
rather larger number of more detailed categories called 'minor/subscript categories'.
To check on reliability, a sample of the tape transcripts (protocols) was given to
two other judges briefed by the researcher to categorise in the same system. The
reliability is reported as follows for one of these protocols (i.e. the think aloud
material associated with one piece of writing by one subject). Look at the
information and answer these questions (Can't use SPSS here):
Is the researcher looking at the reliability overall or in pairs for the three
judges? Are all pairs considered?
Which measure of the ones discussed in sec. 6 is being used as a reliability
coefficient?
Is n the same for all judges? I.e. did they all identify the same number of
strategies, and the issue is simply how far they categorised them the same
way? Or not?
Which of the methods a-c in sec. 6d is being followed here? Do you think it
is a reasonable way of proceeding?
If we try to construct a contingency table to show the details of agreement in
23
categorisation between, say, the researcher and coder A, what problems do
we find? What information is missing in this account?
Do you think the missing information could tell us anything useful?
Extract from Ls thesis draft:
RELIABILITY AND ITEM ANALYSIS IN SITUATION D
So far we have looked at simple ways of statistically assessing reliability in situations A-C
24
outlined at the start - where the same cases are remeasured in the same way on different
occasions or by different judges or by the same judge repeatedly. That enables one to see if the
markers/scorers are being consistent or if what seem like the same innocuous circumstances in
which data is gathered actually have varying effect on scores etc. (See my book...).
However, much reliability work focusses on the internal reliability of measuring instruments.
Wherever a multi-item test, attitude inventory etc. has been used to measure cases, we may well
want to examine this internal reliability. Indeed we can only look at internal reliability where the
measuring technique consists of a series of mini-measures added up to produce an overall score
for something. A reliable test etc. of this sort is then one where all the individual items are
supposed to be consistent and measure the same thing. But internal reliability is only relevant to
a set of items all scored on the same scale are supposed to be measuring 'the same thing' in the
same conditions, not just any old set of items.
If you do a survey with a questionnaire asking questions about people's age, gender, level
of language ability, preference for using a bilingual or monolingual dictionary, etc. is this
a suitable instrument for internal reliability checking?
You do a psycholinguistic experiment with three sets of stimuli representing three
different conditions. E.g. you present native speakers of French with verbs from three
conjugations in French, mixed in with made-up words as distracters, and ask to as fast as
possible press a key if the word they see is a real word of French. There are ten words
from each conjugation: can you usefully analyse the internal reliability of the response
times in each set?
In your study in Pakistan you measure integrative motivation with a set of agree/disagree
items which Gardner uses in his famous studies in Canada etc. Since this is a 'standard'
set of items is there any point in you assessing the internal reliability of that set in your
study?
Note the instrument only has to be used once on a group of suitable cases to allow internal
reliability to be checked. This alone makes checking this sort of reliability more popular among
researchers! The repetition of measurement that was a feature of designs A-C is still there though.
Each item in the test or inventory is conceived as being a remeasure of whatever the test as a
whole is testing in the same people in the same conditions.
Though most associated with pedagogical testing, internal reliability and item analysis applies to
multi-item instruments in many areas of language research, esp. psycholinguistics and applied
linguistics where tests are used. A restriction, however, is that the same items have to have been
used with one set of subjects. In some repeated measures experimental designs in
psycholinguistics, for example, there will have been a randomisation of items over people or
conditions, or use of a Latin square to assign items to conditions and people. The consequence
will be that while all items appear an equal number of times for all subjects and in all conditions,
within one condition (which is the domain within which one would assess reliability) there may
be no set of items for all of which one has scores from the same set of people. For example, some
subjects will have experienced item 1, 3 and 5 in that condition (and 2, 4, 6 in another condition),
others item 2, 4, 6, others item 1, 2, 3 and so on In such a situation one can examine facility
(below) and distraction (if relevant), but not obtain classic alpha or Rasch reliability measures.
This sort of internal reliability analysis naturally combines with 'item analysis', which is a cover
25
term for the analysis of the results for individual items in a test etc., usually as a step towards
improving them for the next time the test etc. is used. However, item analysis can also be based
on other considerations than reliability: see Validity.
These activities are often associated with professionals developing language tests, but are often
also sorely needed to improve those little tests, inventories etc. PhD students use to get data for
their research projects... If you want to elicit evidence of subjects' ability to interpret pronoun
reference in a particular kind of relative clause question in English, say, a common approach is to
make up a set of items, test the subjects, calculate total scores for each subject, then get on to the
interesting bit such as differences between groups or levels in ability to interpret correctly,
compare with other relative clause types, etc. The more cautious reliability conscious researcher
would additionally check if the individual items are in fact 'pulling together' (internal reliability).
After all, you wouldn't be so convinced by the test total score as a measure of someone's relative
clause interpretation competence' if people seem to be scoring quite differently on some items
than on others... That might suggest there are some items in there measuring something else,
probably irrelevant to what you want.
The choice of statistics involved in all this depends somewhat on three aspects:
Is the test etc. relative/norm-referenced or absolute/criterion referenced?
Are the individual items scored dichotomously (=binary right/wrong, yes/no etc.) or on a
more elaborate interval scale (e.g. 0 for wrong, 1 for partly right, 2 for correct)?
Is the design of the test concerned with the grading of items (e.g. from easy to hard) or is
this not relevant/difficulty does not vary much?
Not all the combinations of these can be looked at here.
Incidentally, re. the last of those considerations: There is some excellent information on three
ways of constructing attitude and suchlike inventories/scales of items at the website
http://trochim.human.cornell.edu/kb/scaling.htm. Two (Thurstone and Likert) do not involve
graded items, the third (Guttman) does. However the distinction also applies to sets of language
test items.
7-9. One occasion analysed internally: data taken as interval or binary, with relative/NR
value.
7. Classic norm-referenced internal reliability
Main stats: Cronbach's alpha (alias Kuder-Richardson)
The simplest way of checking if a set of items is reliable is to split them in half, calculate two
scores for each person, one from each half, then use the Pearson r correlation coefficient as seen
above to quantify overall how well the total score of each case for one subset of items correlated
with his/her score for the other subset.
What would be a sensible way to split a set of items in half for this? The first ten items
then the second ten in a 20 item test?
Since the correlation comes from two halves of the one set of observations, it initially constitutes
a reliability coefficient for a test or whatever only half the length of the one you started with.
26
This has to be scaled up, using the Spearman-Brown formula, to give the coefficient for the full
set of observations, twice as long:
Reliability of full set of items = 2 x split halves rel
1 + split halves rel
Further, if the average score and spread (variance) of scores is not the same in the two halves,
there may be an underestimate of reliability. SPSS will take care of most of this for you.
An alternative approach is to use the 'Guttman split-half estimate' (see Bachman, 1990: p175ff).
That does not need adjustment for length and is not affected by non-equivalence of halves.
However, since we have SPSS available, let's use the Rolls-Royce measure - Cronbach's alpha.
This is standardly cited these days in internal reliability study of norm-referenced tests and the
like with uniform sets of items. We already used it above in 4a for multiple judges/raters. This
alpha reliability coefficient can be thought of as the average of the correlation coefficients you
would get if you were to use the split-half method on all the possible divisions of the test items
into two sets, not just the one based on taking alternate items. It works for items each either
scored dichotomously (right/wrong, yes/no etc.) or rated (e.g. out of 4) or scored (e.g. response
times). On dichotomously scored sets of items it is equivalent to the 'Kuder-Richardson 20'
formula (KR20). For calculation of the latter see Allen and Davies (1977: p194ff), also Hatch
and Farhady (1982: p247ff). If items are of equal difficulty, a slightly simpler form of the latter
formula - 'Kuder-Richardson 21' can be used, but alpha really subsumes all these variants.
It needs as input all cases' scores on all individual items. I.e. you need a data grid entered with a
row for each case measured and a column for each item in a test or whatever (assumed to be
measuring a unitary construct, i.e. one thing). If the items are dichotomous, then the grid will
contain 1s and 0s, for yes/correct/agree etc. versus no/wrong/disagree for each person on each
item. Otherwise the grid may contain other numbers.
As examples I have three (ludicrously) small sets of data for you to analyse (in text
files, not .sav ones). They are from three little listening tests made up by a teacher
and used with just five learners on an advanced English course in the Dept (see
sheets attached at end). The files should be available on the C drive of the
computer.
- In my file listtf are the results for five true/false items testing comprehension after
hearing a passage on tape, each scored correct/not. The items are attached.
- In file listgap are the results for correct filling of ten gaps in a cloze version of the
same passage previously heard (attached), each scored correct/not.
- In file listdict are the results for a free dictation. Cases had to write down as good
a version as possible of what they heard (a different passage from the above), not
necessarily word for word. They were scored out of 3 for grammatical etc.
correctness of each of the five sentences of the passage.
In which of these tests are the items 'dichotomous' then?
Load the first of these (remember FileRead text data) look at the Data sheet to
27
see what the data looks like.
Does anything strike you by eye?
Maybe it is things we will see later from the statistics. In a small data set eyeballing
is often just as good as stats, but on the larger scale it isn't possible.
To calculate alpha you then choose Analyze...Scale...Reliability analysis.... The
default is alpha though you can choose split half etc. if you want. Sticking with
alpha, highlight all your items and transfer them in one go into the right hand box
for treatment.
Click the Statistics button to get some things we will need below for Item
Analysis. What you will need is obtained by choosing Descriptives for Item, and
Scale if Item Deleted. In SPSS it doesn't pay to ask for everything. You tend to
end up unable to see the wood for the trees. Proceed.
You may get a message 'Warning zero variance items'.
What does that mean? What is zero variance?
What value of alpha do you get?
Alpha, like a correlation coefficient, has a maximum of 1 if reliability is perfect.
Anything below .8 would generally be considered rather poor in reliability studies.
If your alpha is less than .8, why do you think that is?
Load the other files and do the same. But I suggest you save each one as a SPSS
file (with suffix .sav) as you go along so you can easily get it back to look at it
again. (Remember: File...Save Data...).
Which of the three listening tests is the most reliable?
Why do you think that is? Hint: remember alpha is a measure related to the Pearson
r correlation coefficient - what things did we find earlier helped increase Pearson
rs?
28
A problem which all the above approaches suffer from is that, to give an unbiassed coefficient of
reliability, it has to be assumed that any person's response to any given item in a multi-item test
or inventory is uninfluenced by their response to any other - i.e. that there is 'independence' of
every item from every other one. This may not always be so, and is hard to check anyway. The
consequence of violation of independence tends to be an overestimate of reliability. However,
there is often little one can do about this so it is usually glossed over.
Which of the above tests do you think is more likely to lack this sort of
independence of response to different items?
How could you change the test to counter this?
In a psycholinguistic experiment where each condition is represented by a test-like set of items,
this independence is more certain if the sets of items are not given separately, but all mixed up
with each other in one list administered together.
8. Classic norm-referenced item analysis, no grading of items at issue
29
Main stats: mean score per item, item - total correlation, alpha if item deleted, chi squared fit test.
Traditional NR item analysis looks mainly at two aspects of items - their difficulty/facility and
their discrimination/discriminability.
8a. Facility. A norm-referenced measure is designed to show clearly which cases score higher
than which others, not the absolute level of knowledge of anything that each case has. So which
of the following items are 'good' in such a test, which 'bad'?
Items which most/all people get right?
Items which most/all people get wrong?
Items which about half the cases get right, half wrong?
If you are using a multi-item test or inventory as a research tool, e.g. to gather data on
memory for content of passages on different topics, with a hypothesis that where the
reader knows more about the topic in advance they remember more, would you want a
relative or absolute measure of how much of the content is retained?
You are wanting to compare native speaker and non-native speaker teachers of English
for their focus on correctness. You give a few of them numerous sentences with possible
errors in to mark in your pilot of your elicitation instrument. Some items are marked
wrong by all of them. Do you leave them out in the revised instrument?
How do the concepts of 'ceiling effect' and 'floor effect' relate to this matter?
Can you see how facility relates to NR reliability? An item which everyone gets wrong
has a standard deviation of...what? If all items in a test were got mostly right or mostly
wrong, the variation in total test scores would be...high or low? And do correlation
coefficients of the Pearson r type come out higher or lower if there are higher degrees of
variation among the scores being correlated?
8b. Discrimination. With the idea of internal reliability as described, a good item is obviously
one that pulls together with the others and so 'fits' the overall pattern of responses. In short, a
good item is one that the people who did best on the whole test got right and that the worst ones
got wrong - it discriminates cases in the same way as the test as a whole. An item got right by a
lot of people who were the worst overall is not discriminating helpfully. This can be measured by
the correlation coefficient of each case's score for the individual item with their score for the
whole test. (There are older by-hand methods for this involving comparing items' popularity with
the best third and the bottom third of testees/cases, not gone over here).
The output of the SPSS procedure you have already generated gives you all the
facility and discrimination information you need.
For facility look at the item means - i.e. the average score calculated for the item
rather than a person taking the test.
If the items are scored 0 or 1, then 'good' means in reliability terms are near...
what value?
30
What are undesirable means - indicating items that might usefully be left
out?
If an item scored 0/1 has a mean of .7, what % of the testees got it right?
In the test with items scored out of three, what is the ideal mean for an item?
For discrimination, look at the item-total correlations. These are based on each
person's score for the item and their score for the whole test. If the item is
discriminating well, then its Pearson r correlation with the total test will be high.
Why do some of them come out as 0?
A third useful bit of information for each item is the alpha if item deleted column.
This reflects both facility and discrimination and tells you simply what the
reliability of the whole set of items would be if that item were left out.
To improve a set of items obviously you remove or replace items that
would ?increase or decrease? alpha if left out - which?
Generally items with means near the extremes and with poor item-total correlations
will increase alpha by being left out.
Can you see which items would best be replaced? Why?
Do all this for all three listening tests and pick some items that are unsatisfactory in
each.
Can you see from the actual items why they are unsuitable?
8c. Distraction. Finally on item analysis, it is worth mentioning something further you can do
exclusively where the items are multiple choice. A basic rule of multiple choice tests (but far
from the only one) is that the 'distractors' should genuinely 'distract'. If there are four alternatives
but two are so ludicrously wrong that nobody would choose them, a four choice item in effect
becomes a two choice one - much easier to guess the answer from, with consequent unreliability.
To check on this, you simply have to have available information not just on who got each item
right or wrong, but also, if they got it wrong, on which distractor they chose.
Below are some hand counted figures for a few items related to prepositions in a
large scale piece of test piloting by Mukattash in Jordan some years ago.
Look at the descriptive statistics for the distractors (the frequencies) and answer
these questions for each item:
Which is the least popular distractor?
Does it differ in different types of school (i.e. for different populations of
testees)?
31
Can you tell why it is the least popular?
Of course distractors are never going to be chosen exactly equally, so you might
like a way of deciding where they are being chosen so differently that it is worth
revising them.
On your judgment, in which of these examples is the least popular distractor
so unpopular that you would change it in a revised test?
A statistical means for deciding the issue would involve the chi-squared goodness of fit test
(which is basically the one variable version of chi squared, used where frequencies in three or
more unordered categories need to be checked against 'expected' frequencies). This version of
chi squared would test the hypothesis that there is no difference between the observed
proportions of responses falling in the three distractor categories, and the proportions you would
expect if equal numbers fell in each (the 'expected' frequencies), beyond what you would expect
from the vagaries of sampling.
If that hypothesis is rejected, then ... do you change the distractors or not?
In item 9, in the vocational school, how many responses would you expect
each of the three distractors to attract if they were equally attractive?
To do the statistical test, enter the figures by the short method (cf LG475 tasks). You need a
column to label the three categories (distractors) and another with the three actual frequencies.
Then get SPSS to weight the data by the column with the frequencies. Then choose
Analyze...Non-parametric tests...Chi Square. Choose the category column as the test variable.
Keep everything else as given.
32
In the result, the Expected column tells you the frequencies under the null hypothesis: if all
distractors were equally distracting. The Residual is the Observed frequency minus the Expected
one.
The least successful distractor will have a high ? negative or positive? residual - which?
If chi squared is significant, that is a reflection of high positive and negative residuals, i.e.
responses to distractors so uneven as to merit serious consideration of changing at least one of
33
them. I.e. it is unlikely the observed proportions could have arisen by chance from a population
where the division of choice was actually equal.
Is chi squared significant for distractors of Item 9, vocational school data?
Try this for some other items/schools.
For different schools would you come to different decisions about suitability
of the same item?
9. Item analysis where grading of items is involved: binary/dichotomous data
Main stats: Guttman's scalability coefficient, Rasch t.
Classic norm-referenced reliability and IA techniques identify and help one develop multi-item
instruments that distinguish between cases very consistently and finely where the cases are of a
similar ability. They don't tell you how much of anything anyone knows in any absolute sense -
you need a criterion-referenced measure for that. And they don't work well on a wide band of
ability - you need instruments developed to have the properties that Rasch analysis focuses on
for that. You can of course make a whole series of norm-referenced tests for different levels.
However, there is no easy way to relate the score of a person on one such test to their score on
another, because the tests would have to be compiled and refined separately for each population.
In testing circles there is a lot of interest nowadays in tests which contain items in a whole range
of difficulties (though not necessarily presented to test-takers in order of increasing difficulty), so
that, for example, one test will do for beginner and intermediate learners or for children of a wide
span of ages or for dialect speakers ranging from very broad to practically standard. A related
idea is that you can establish an 'item bank' from which you draw items to make up tests suited to
particular levels of case. But these items are all related to each other in such a way that you can
relate someone's score on one such test to someone else's score on another.
Compared with the classic approach above, items are chosen and checked not necessarily to be
of middling difficulty for everyone - there is a good range of difficulty - but discriminability of a
sort is retained. This is done by requiring items to form an 'implicational scale'. That means that a
set of test items of this sort will range from hard to easy, but in such a way that people who get
harder ones right also get right those items that are generally easier (i.e. which more people get
right). That is a third reliability-type concept, alongside the classical NR high correlation one
and the absolute CR high absolute agreementone.
Full story skipped on this occasion. See me if you need it!
10. One occasion analysed internally: data taken as interval or binary, and absolute/CR.
Classical reliability and IA is not criterion-referenced and the Rasch approach, though it claims
to be, is not straightforwardly so. There are various sophisticated ideas about how to improve
internal reliability of CR tests not covered here (e.g. see Subkoviak and Baker (1977) and
Bachman (1990: 212ff)). But one could use the approach of 4c above as a simple approach.
Calculate an SD for each person across their responses, and take the mean of those SDs as an
34
indication of reliability. Low SD means more agreement. Identify bad items on the basis of
seeing which items, if omitted, lead to the best reduction in mean SD over all subjects (analogous
to alpha if item deleted). And identify subjects that are especially giving rise to disagreement
too.
My ABSREL2 programme does that for you.
CONCLUSION ON RELIABILITY AND RELATED IA
11. A further concern may be as follows. Ideally one does all the reliability checking in a pilot
study, and uses the information to revise everything for the main study. But suppose one has
done the relevant things from the above in one's main study. There is no possibility of revising
the instrument or procedure and going back and doing it again, so one has to just exploit the
figures one has got to the best possible extent in the analysis of the real RESULTS of one's
enquiry, with due acknowledgement of its possible failings. But what figures does one use?
Typical procedures used to 'massage' data:
In general.
Leave out cases that seem odd/don't fit. But make a good argument for them not belonging to
your targeted population on grounds other than just that their responses weren't right! E.g. you
could show that, prompted by their odd responses, you looked back at your notes from the data
gathering and can see that that person was rather unwilling to participate, or was a learner who
had learnt English under different conditions from the other subjects, etc..... Of course this option
is not available in pedagogical uses of tests where one is interested in the scores of all particular
students.
In design (D).
Leave out whole items that don't fit the pattern of the set they belong to. That is less
controversial. You may also be able to identify particular responses of particular cases that are
odd and omit just them: e.g. in response time tests it is common to leave out responses that are
overlong or overshort
What argument could you use to support that?
How could you define 'too long'?
For onward analysis of results you then have to calculate a score for each case/person that is their
average over the scores that you have decided to retain, which may be a different number for
different cases/people. This also arises in general if there are missing values/scores for other
reasons. If everything is in one row per case format in SPSS you can do this by using
Transform... Compute... and selecting MEAN(numexpr,numexpr,...) from the list of
functions. This calculates an average over the non-missing values in a set of columns. So if your
data was three cases on three items, with some missing responses thus:
Item1 Item2 Item3
Case1 1 0 0
Case2 1 . 0
Case3 . . 1
35
Fill in the formula in SPSS as MEAN(Item1,Item2,Item3) and the new column will come out
with averages using only the non-missing columns. Here that is .33, .5 and 1.
In design (A) and (B) and (C).
The usual wisdom is that the mean of several separate measurements is always more reliable
than one measurement of anything. But if two out of three judgements/occasions agree well with
each other, and differ from the third, maybe leave out the scores from the 'odd' judgement that
didn't agree well, and use the average of the other two. If different judges/scorers are involved,
then you can leave out an unreliable scorer and use the scores averaged over the others. Or train
up that scorer a bit more and get him/her to score the protocols again!
When combining figures from several judges/occasions etc. there is a special problem if the data
is not interval scores or rank orders (which can be averaged with the mean). With categorisations
you would typically use the category that the majority of judgements placed a case in (i.e. the
'modal' category): so it is useful to use uneven numbers of judges to make this easy! E.g. if two
out of three times when you recategorised your think aloud data you thought a certain strategy
was 'Prediction', that is the one you regard it as being for your actual analysis of results.
There is quite a bit about the potential causes of unreliability in my book. Here we have only
looked at how to check on it statistically.
PJS revised a little Jan 2012

Anda mungkin juga menyukai