22
But to see the agreement of many judges this can be easier to interpret by eye, especially where
the number of judges is greater than the number of categories:
Categories 1, 2 3 etc. of the nominal scale
Cases Columns contain figures representing the
measured number of judges who put each case in each category.
From whichever you use, you need to summarise for each case, in a new column on the right,
what the agreement is for each case (e.g. if there are three judges and three categories, then it can
logically be either 3-0 or 2-1 or 1-1-1). Then you count the numbers of each type of agreement
and express it as % of all the cases. This is probably simplest done by hand from the columns of
raw figures. I have not found a simple way for SPSS to do it.
6f. The problems described in 6d can also arise again, where one gets three or more people
(including oneself, perhaps) to categorise strategies from transcriptions of taped material.
Real example for consideration (design A). A student got a number of Arabic
speaking learners of English to write compositions both in English and Arabic, and
report in 'think aloud' fashion on what they were doing as they wrote. The taped
think aloud protocols were then each gone through by the researcher who identified
distinct bits of writing behaviour/strategies and categorised them into the major and
minor categories of Perl's system and counted up frequencies of use of the various
behaviours. Perl's system has 23 'major/main categories', with labels such as
'general planning', 'rehearsing', 'scanning back over text so far', 'revising'. There is a
rather larger number of more detailed categories called 'minor/subscript categories'.
To check on reliability, a sample of the tape transcripts (protocols) was given to
two other judges briefed by the researcher to categorise in the same system. The
reliability is reported as follows for one of these protocols (i.e. the think aloud
material associated with one piece of writing by one subject). Look at the
information and answer these questions (Can't use SPSS here):
Is the researcher looking at the reliability overall or in pairs for the three
judges? Are all pairs considered?
Which measure of the ones discussed in sec. 6 is being used as a reliability
coefficient?
Is n the same for all judges? I.e. did they all identify the same number of
strategies, and the issue is simply how far they categorised them the same
way? Or not?
Which of the methods a-c in sec. 6d is being followed here? Do you think it
is a reasonable way of proceeding?
If we try to construct a contingency table to show the details of agreement in
23
categorisation between, say, the researcher and coder A, what problems do
we find? What information is missing in this account?
Do you think the missing information could tell us anything useful?
Extract from Ls thesis draft:
RELIABILITY AND ITEM ANALYSIS IN SITUATION D
So far we have looked at simple ways of statistically assessing reliability in situations A-C
24
outlined at the start - where the same cases are remeasured in the same way on different
occasions or by different judges or by the same judge repeatedly. That enables one to see if the
markers/scorers are being consistent or if what seem like the same innocuous circumstances in
which data is gathered actually have varying effect on scores etc. (See my book...).
However, much reliability work focusses on the internal reliability of measuring instruments.
Wherever a multi-item test, attitude inventory etc. has been used to measure cases, we may well
want to examine this internal reliability. Indeed we can only look at internal reliability where the
measuring technique consists of a series of mini-measures added up to produce an overall score
for something. A reliable test etc. of this sort is then one where all the individual items are
supposed to be consistent and measure the same thing. But internal reliability is only relevant to
a set of items all scored on the same scale are supposed to be measuring 'the same thing' in the
same conditions, not just any old set of items.
If you do a survey with a questionnaire asking questions about people's age, gender, level
of language ability, preference for using a bilingual or monolingual dictionary, etc. is this
a suitable instrument for internal reliability checking?
You do a psycholinguistic experiment with three sets of stimuli representing three
different conditions. E.g. you present native speakers of French with verbs from three
conjugations in French, mixed in with made-up words as distracters, and ask to as fast as
possible press a key if the word they see is a real word of French. There are ten words
from each conjugation: can you usefully analyse the internal reliability of the response
times in each set?
In your study in Pakistan you measure integrative motivation with a set of agree/disagree
items which Gardner uses in his famous studies in Canada etc. Since this is a 'standard'
set of items is there any point in you assessing the internal reliability of that set in your
study?
Note the instrument only has to be used once on a group of suitable cases to allow internal
reliability to be checked. This alone makes checking this sort of reliability more popular among
researchers! The repetition of measurement that was a feature of designs A-C is still there though.
Each item in the test or inventory is conceived as being a remeasure of whatever the test as a
whole is testing in the same people in the same conditions.
Though most associated with pedagogical testing, internal reliability and item analysis applies to
multi-item instruments in many areas of language research, esp. psycholinguistics and applied
linguistics where tests are used. A restriction, however, is that the same items have to have been
used with one set of subjects. In some repeated measures experimental designs in
psycholinguistics, for example, there will have been a randomisation of items over people or
conditions, or use of a Latin square to assign items to conditions and people. The consequence
will be that while all items appear an equal number of times for all subjects and in all conditions,
within one condition (which is the domain within which one would assess reliability) there may
be no set of items for all of which one has scores from the same set of people. For example, some
subjects will have experienced item 1, 3 and 5 in that condition (and 2, 4, 6 in another condition),
others item 2, 4, 6, others item 1, 2, 3 and so on In such a situation one can examine facility
(below) and distraction (if relevant), but not obtain classic alpha or Rasch reliability measures.
This sort of internal reliability analysis naturally combines with 'item analysis', which is a cover
25
term for the analysis of the results for individual items in a test etc., usually as a step towards
improving them for the next time the test etc. is used. However, item analysis can also be based
on other considerations than reliability: see Validity.
These activities are often associated with professionals developing language tests, but are often
also sorely needed to improve those little tests, inventories etc. PhD students use to get data for
their research projects... If you want to elicit evidence of subjects' ability to interpret pronoun
reference in a particular kind of relative clause question in English, say, a common approach is to
make up a set of items, test the subjects, calculate total scores for each subject, then get on to the
interesting bit such as differences between groups or levels in ability to interpret correctly,
compare with other relative clause types, etc. The more cautious reliability conscious researcher
would additionally check if the individual items are in fact 'pulling together' (internal reliability).
After all, you wouldn't be so convinced by the test total score as a measure of someone's relative
clause interpretation competence' if people seem to be scoring quite differently on some items
than on others... That might suggest there are some items in there measuring something else,
probably irrelevant to what you want.
The choice of statistics involved in all this depends somewhat on three aspects:
Is the test etc. relative/norm-referenced or absolute/criterion referenced?
Are the individual items scored dichotomously (=binary right/wrong, yes/no etc.) or on a
more elaborate interval scale (e.g. 0 for wrong, 1 for partly right, 2 for correct)?
Is the design of the test concerned with the grading of items (e.g. from easy to hard) or is
this not relevant/difficulty does not vary much?
Not all the combinations of these can be looked at here.
Incidentally, re. the last of those considerations: There is some excellent information on three
ways of constructing attitude and suchlike inventories/scales of items at the website
http://trochim.human.cornell.edu/kb/scaling.htm. Two (Thurstone and Likert) do not involve
graded items, the third (Guttman) does. However the distinction also applies to sets of language
test items.
7-9. One occasion analysed internally: data taken as interval or binary, with relative/NR
value.
7. Classic norm-referenced internal reliability
Main stats: Cronbach's alpha (alias Kuder-Richardson)
The simplest way of checking if a set of items is reliable is to split them in half, calculate two
scores for each person, one from each half, then use the Pearson r correlation coefficient as seen
above to quantify overall how well the total score of each case for one subset of items correlated
with his/her score for the other subset.
What would be a sensible way to split a set of items in half for this? The first ten items
then the second ten in a 20 item test?
Since the correlation comes from two halves of the one set of observations, it initially constitutes
a reliability coefficient for a test or whatever only half the length of the one you started with.
26
This has to be scaled up, using the Spearman-Brown formula, to give the coefficient for the full
set of observations, twice as long:
Reliability of full set of items = 2 x split halves rel
1 + split halves rel
Further, if the average score and spread (variance) of scores is not the same in the two halves,
there may be an underestimate of reliability. SPSS will take care of most of this for you.
An alternative approach is to use the 'Guttman split-half estimate' (see Bachman, 1990: p175ff).
That does not need adjustment for length and is not affected by non-equivalence of halves.
However, since we have SPSS available, let's use the Rolls-Royce measure - Cronbach's alpha.
This is standardly cited these days in internal reliability study of norm-referenced tests and the
like with uniform sets of items. We already used it above in 4a for multiple judges/raters. This
alpha reliability coefficient can be thought of as the average of the correlation coefficients you
would get if you were to use the split-half method on all the possible divisions of the test items
into two sets, not just the one based on taking alternate items. It works for items each either
scored dichotomously (right/wrong, yes/no etc.) or rated (e.g. out of 4) or scored (e.g. response
times). On dichotomously scored sets of items it is equivalent to the 'Kuder-Richardson 20'
formula (KR20). For calculation of the latter see Allen and Davies (1977: p194ff), also Hatch
and Farhady (1982: p247ff). If items are of equal difficulty, a slightly simpler form of the latter
formula - 'Kuder-Richardson 21' can be used, but alpha really subsumes all these variants.
It needs as input all cases' scores on all individual items. I.e. you need a data grid entered with a
row for each case measured and a column for each item in a test or whatever (assumed to be
measuring a unitary construct, i.e. one thing). If the items are dichotomous, then the grid will
contain 1s and 0s, for yes/correct/agree etc. versus no/wrong/disagree for each person on each
item. Otherwise the grid may contain other numbers.
As examples I have three (ludicrously) small sets of data for you to analyse (in text
files, not .sav ones). They are from three little listening tests made up by a teacher
and used with just five learners on an advanced English course in the Dept (see
sheets attached at end). The files should be available on the C drive of the
computer.
- In my file listtf are the results for five true/false items testing comprehension after
hearing a passage on tape, each scored correct/not. The items are attached.
- In file listgap are the results for correct filling of ten gaps in a cloze version of the
same passage previously heard (attached), each scored correct/not.
- In file listdict are the results for a free dictation. Cases had to write down as good
a version as possible of what they heard (a different passage from the above), not
necessarily word for word. They were scored out of 3 for grammatical etc.
correctness of each of the five sentences of the passage.
In which of these tests are the items 'dichotomous' then?
Load the first of these (remember FileRead text data) look at the Data sheet to
27
see what the data looks like.
Does anything strike you by eye?
Maybe it is things we will see later from the statistics. In a small data set eyeballing
is often just as good as stats, but on the larger scale it isn't possible.
To calculate alpha you then choose Analyze...Scale...Reliability analysis.... The
default is alpha though you can choose split half etc. if you want. Sticking with
alpha, highlight all your items and transfer them in one go into the right hand box
for treatment.
Click the Statistics button to get some things we will need below for Item
Analysis. What you will need is obtained by choosing Descriptives for Item, and
Scale if Item Deleted. In SPSS it doesn't pay to ask for everything. You tend to
end up unable to see the wood for the trees. Proceed.
You may get a message 'Warning zero variance items'.
What does that mean? What is zero variance?
What value of alpha do you get?
Alpha, like a correlation coefficient, has a maximum of 1 if reliability is perfect.
Anything below .8 would generally be considered rather poor in reliability studies.
If your alpha is less than .8, why do you think that is?
Load the other files and do the same. But I suggest you save each one as a SPSS
file (with suffix .sav) as you go along so you can easily get it back to look at it
again. (Remember: File...Save Data...).
Which of the three listening tests is the most reliable?
Why do you think that is? Hint: remember alpha is a measure related to the Pearson
r correlation coefficient - what things did we find earlier helped increase Pearson
rs?
28
A problem which all the above approaches suffer from is that, to give an unbiassed coefficient of
reliability, it has to be assumed that any person's response to any given item in a multi-item test
or inventory is uninfluenced by their response to any other - i.e. that there is 'independence' of
every item from every other one. This may not always be so, and is hard to check anyway. The
consequence of violation of independence tends to be an overestimate of reliability. However,
there is often little one can do about this so it is usually glossed over.
Which of the above tests do you think is more likely to lack this sort of
independence of response to different items?
How could you change the test to counter this?
In a psycholinguistic experiment where each condition is represented by a test-like set of items,
this independence is more certain if the sets of items are not given separately, but all mixed up
with each other in one list administered together.
8. Classic norm-referenced item analysis, no grading of items at issue
29
Main stats: mean score per item, item - total correlation, alpha if item deleted, chi squared fit test.
Traditional NR item analysis looks mainly at two aspects of items - their difficulty/facility and
their discrimination/discriminability.
8a. Facility. A norm-referenced measure is designed to show clearly which cases score higher
than which others, not the absolute level of knowledge of anything that each case has. So which
of the following items are 'good' in such a test, which 'bad'?
Items which most/all people get right?
Items which most/all people get wrong?
Items which about half the cases get right, half wrong?
If you are using a multi-item test or inventory as a research tool, e.g. to gather data on
memory for content of passages on different topics, with a hypothesis that where the
reader knows more about the topic in advance they remember more, would you want a
relative or absolute measure of how much of the content is retained?
You are wanting to compare native speaker and non-native speaker teachers of English
for their focus on correctness. You give a few of them numerous sentences with possible
errors in to mark in your pilot of your elicitation instrument. Some items are marked
wrong by all of them. Do you leave them out in the revised instrument?
How do the concepts of 'ceiling effect' and 'floor effect' relate to this matter?
Can you see how facility relates to NR reliability? An item which everyone gets wrong
has a standard deviation of...what? If all items in a test were got mostly right or mostly
wrong, the variation in total test scores would be...high or low? And do correlation
coefficients of the Pearson r type come out higher or lower if there are higher degrees of
variation among the scores being correlated?
8b. Discrimination. With the idea of internal reliability as described, a good item is obviously
one that pulls together with the others and so 'fits' the overall pattern of responses. In short, a
good item is one that the people who did best on the whole test got right and that the worst ones
got wrong - it discriminates cases in the same way as the test as a whole. An item got right by a
lot of people who were the worst overall is not discriminating helpfully. This can be measured by
the correlation coefficient of each case's score for the individual item with their score for the
whole test. (There are older by-hand methods for this involving comparing items' popularity with
the best third and the bottom third of testees/cases, not gone over here).
The output of the SPSS procedure you have already generated gives you all the
facility and discrimination information you need.
For facility look at the item means - i.e. the average score calculated for the item
rather than a person taking the test.
If the items are scored 0 or 1, then 'good' means in reliability terms are near...
what value?
30
What are undesirable means - indicating items that might usefully be left
out?
If an item scored 0/1 has a mean of .7, what % of the testees got it right?
In the test with items scored out of three, what is the ideal mean for an item?
For discrimination, look at the item-total correlations. These are based on each
person's score for the item and their score for the whole test. If the item is
discriminating well, then its Pearson r correlation with the total test will be high.
Why do some of them come out as 0?
A third useful bit of information for each item is the alpha if item deleted column.
This reflects both facility and discrimination and tells you simply what the
reliability of the whole set of items would be if that item were left out.
To improve a set of items obviously you remove or replace items that
would ?increase or decrease? alpha if left out - which?
Generally items with means near the extremes and with poor item-total correlations
will increase alpha by being left out.
Can you see which items would best be replaced? Why?
Do all this for all three listening tests and pick some items that are unsatisfactory in
each.
Can you see from the actual items why they are unsuitable?
8c. Distraction. Finally on item analysis, it is worth mentioning something further you can do
exclusively where the items are multiple choice. A basic rule of multiple choice tests (but far
from the only one) is that the 'distractors' should genuinely 'distract'. If there are four alternatives
but two are so ludicrously wrong that nobody would choose them, a four choice item in effect
becomes a two choice one - much easier to guess the answer from, with consequent unreliability.
To check on this, you simply have to have available information not just on who got each item
right or wrong, but also, if they got it wrong, on which distractor they chose.
Below are some hand counted figures for a few items related to prepositions in a
large scale piece of test piloting by Mukattash in Jordan some years ago.
Look at the descriptive statistics for the distractors (the frequencies) and answer
these questions for each item:
Which is the least popular distractor?
Does it differ in different types of school (i.e. for different populations of
testees)?
31
Can you tell why it is the least popular?
Of course distractors are never going to be chosen exactly equally, so you might
like a way of deciding where they are being chosen so differently that it is worth
revising them.
On your judgment, in which of these examples is the least popular distractor
so unpopular that you would change it in a revised test?
A statistical means for deciding the issue would involve the chi-squared goodness of fit test
(which is basically the one variable version of chi squared, used where frequencies in three or
more unordered categories need to be checked against 'expected' frequencies). This version of
chi squared would test the hypothesis that there is no difference between the observed
proportions of responses falling in the three distractor categories, and the proportions you would
expect if equal numbers fell in each (the 'expected' frequencies), beyond what you would expect
from the vagaries of sampling.
If that hypothesis is rejected, then ... do you change the distractors or not?
In item 9, in the vocational school, how many responses would you expect
each of the three distractors to attract if they were equally attractive?
To do the statistical test, enter the figures by the short method (cf LG475 tasks). You need a
column to label the three categories (distractors) and another with the three actual frequencies.
Then get SPSS to weight the data by the column with the frequencies. Then choose
Analyze...Non-parametric tests...Chi Square. Choose the category column as the test variable.
Keep everything else as given.
32
In the result, the Expected column tells you the frequencies under the null hypothesis: if all
distractors were equally distracting. The Residual is the Observed frequency minus the Expected
one.
The least successful distractor will have a high ? negative or positive? residual - which?
If chi squared is significant, that is a reflection of high positive and negative residuals, i.e.
responses to distractors so uneven as to merit serious consideration of changing at least one of
33
them. I.e. it is unlikely the observed proportions could have arisen by chance from a population
where the division of choice was actually equal.
Is chi squared significant for distractors of Item 9, vocational school data?
Try this for some other items/schools.
For different schools would you come to different decisions about suitability
of the same item?
9. Item analysis where grading of items is involved: binary/dichotomous data
Main stats: Guttman's scalability coefficient, Rasch t.
Classic norm-referenced reliability and IA techniques identify and help one develop multi-item
instruments that distinguish between cases very consistently and finely where the cases are of a
similar ability. They don't tell you how much of anything anyone knows in any absolute sense -
you need a criterion-referenced measure for that. And they don't work well on a wide band of
ability - you need instruments developed to have the properties that Rasch analysis focuses on
for that. You can of course make a whole series of norm-referenced tests for different levels.
However, there is no easy way to relate the score of a person on one such test to their score on
another, because the tests would have to be compiled and refined separately for each population.
In testing circles there is a lot of interest nowadays in tests which contain items in a whole range
of difficulties (though not necessarily presented to test-takers in order of increasing difficulty), so
that, for example, one test will do for beginner and intermediate learners or for children of a wide
span of ages or for dialect speakers ranging from very broad to practically standard. A related
idea is that you can establish an 'item bank' from which you draw items to make up tests suited to
particular levels of case. But these items are all related to each other in such a way that you can
relate someone's score on one such test to someone else's score on another.
Compared with the classic approach above, items are chosen and checked not necessarily to be
of middling difficulty for everyone - there is a good range of difficulty - but discriminability of a
sort is retained. This is done by requiring items to form an 'implicational scale'. That means that a
set of test items of this sort will range from hard to easy, but in such a way that people who get
harder ones right also get right those items that are generally easier (i.e. which more people get
right). That is a third reliability-type concept, alongside the classical NR high correlation one
and the absolute CR high absolute agreementone.
Full story skipped on this occasion. See me if you need it!
10. One occasion analysed internally: data taken as interval or binary, and absolute/CR.
Classical reliability and IA is not criterion-referenced and the Rasch approach, though it claims
to be, is not straightforwardly so. There are various sophisticated ideas about how to improve
internal reliability of CR tests not covered here (e.g. see Subkoviak and Baker (1977) and
Bachman (1990: 212ff)). But one could use the approach of 4c above as a simple approach.
Calculate an SD for each person across their responses, and take the mean of those SDs as an
34
indication of reliability. Low SD means more agreement. Identify bad items on the basis of
seeing which items, if omitted, lead to the best reduction in mean SD over all subjects (analogous
to alpha if item deleted). And identify subjects that are especially giving rise to disagreement
too.
My ABSREL2 programme does that for you.
CONCLUSION ON RELIABILITY AND RELATED IA
11. A further concern may be as follows. Ideally one does all the reliability checking in a pilot
study, and uses the information to revise everything for the main study. But suppose one has
done the relevant things from the above in one's main study. There is no possibility of revising
the instrument or procedure and going back and doing it again, so one has to just exploit the
figures one has got to the best possible extent in the analysis of the real RESULTS of one's
enquiry, with due acknowledgement of its possible failings. But what figures does one use?
Typical procedures used to 'massage' data:
In general.
Leave out cases that seem odd/don't fit. But make a good argument for them not belonging to
your targeted population on grounds other than just that their responses weren't right! E.g. you
could show that, prompted by their odd responses, you looked back at your notes from the data
gathering and can see that that person was rather unwilling to participate, or was a learner who
had learnt English under different conditions from the other subjects, etc..... Of course this option
is not available in pedagogical uses of tests where one is interested in the scores of all particular
students.
In design (D).
Leave out whole items that don't fit the pattern of the set they belong to. That is less
controversial. You may also be able to identify particular responses of particular cases that are
odd and omit just them: e.g. in response time tests it is common to leave out responses that are
overlong or overshort
What argument could you use to support that?
How could you define 'too long'?
For onward analysis of results you then have to calculate a score for each case/person that is their
average over the scores that you have decided to retain, which may be a different number for
different cases/people. This also arises in general if there are missing values/scores for other
reasons. If everything is in one row per case format in SPSS you can do this by using
Transform... Compute... and selecting MEAN(numexpr,numexpr,...) from the list of
functions. This calculates an average over the non-missing values in a set of columns. So if your
data was three cases on three items, with some missing responses thus:
Item1 Item2 Item3
Case1 1 0 0
Case2 1 . 0
Case3 . . 1
35
Fill in the formula in SPSS as MEAN(Item1,Item2,Item3) and the new column will come out
with averages using only the non-missing columns. Here that is .33, .5 and 1.
In design (A) and (B) and (C).
The usual wisdom is that the mean of several separate measurements is always more reliable
than one measurement of anything. But if two out of three judgements/occasions agree well with
each other, and differ from the third, maybe leave out the scores from the 'odd' judgement that
didn't agree well, and use the average of the other two. If different judges/scorers are involved,
then you can leave out an unreliable scorer and use the scores averaged over the others. Or train
up that scorer a bit more and get him/her to score the protocols again!
When combining figures from several judges/occasions etc. there is a special problem if the data
is not interval scores or rank orders (which can be averaged with the mean). With categorisations
you would typically use the category that the majority of judgements placed a case in (i.e. the
'modal' category): so it is useful to use uneven numbers of judges to make this easy! E.g. if two
out of three times when you recategorised your think aloud data you thought a certain strategy
was 'Prediction', that is the one you regard it as being for your actual analysis of results.
There is quite a bit about the potential causes of unreliability in my book. Here we have only
looked at how to check on it statistically.
PJS revised a little Jan 2012