Anda di halaman 1dari 12

Authentic language tests: where from

and where to?

Elana Shohamy Tel Aviv University and
Thea Reves Bar-llan University

In the first part of the paper the development of language tests towards
authenticity is surveyed. The advantages and shortcomings of indirect vs
direct (authentic) types are analysed. Whereas indirect tests were more
efficient and lent themselves to psychometric analyses, they did not tap
real-life language. On the other hand, direct tests based on real interactions
aim at capturing real language and its variations in use. In the second part
of the paper two problems involved in authentic tests are addressed:
a) The difficulty of applying appropriate psychometric measures, caused
by the complexity of tapping the whole construct of authentic language,
and b) the large number of test variables which interfere with the authen-
ticity of the language produced and reduce it to authentic test language.

I Towards authenticity: survey of the trend


The topic of authentic tests has become so popular in recent years

that it is hard to believe that a little more than a decade ago it was
not an issue at all. At that time psychometric properties were the
major criteria by which language tests were judged, and issues such as
objective items, reliability and concurrent validity dominated the
language testing field. That period may be regarded as belonging to
the psychometric-structuralist era (Spolsky, 1978), but it was also
a time when the first signs of a need to assess oral proficiency of
students and prospective teachers began to appear. The first tests
of oral proficiency were administered in the language laboratory,
i.e. a small booth equipped with a tape recorder with which individual
test takers were expected to converse. The oral tasks usually included
the mechanical repetition of words and sentences and the supplying
of pattern answers to pattern questions. It was important to talk at
a pace fast enough to beat the beep sound of the tape, and slow

enough so as not to have too many silent moments before the beep
went off. ,

It has to be recognized that the testing of speaking proficiency on


standardized language tests was an innovation in itself at that time;

before that, language tests had generally not included speaking tasks
at all. But testing speaking on standardized tests proved that oral
proficiency could be assessed in spite of the difficulties involved
in the testing of students on an individual basis. In these early tests,
how speaking was tested was less important; what did matter was
that the test should be psychometrically solid.
However, even at that time, scholars contended that the psycho-
metric properties are necessary but not sufficient criteria of good
language tests. Language in these early tests was assessed in artificial
circumstances; test takers talked to machines and not to other
human beings, which is very different from the way human beings
normally use language. In those new speaking tests this means that
the intonation and the length of sentences were very different from
real spoken discourse.
This was about the time (1972) when John L.D. Clark wrote his
article Theoretical and technical considerations in oral testing
(later to be published in Jones and Spolsky (1975)). In that article
Clark differentiated between direct and indirect language tests; his
definition of direct tests is identical to what we term today authen-
tic tests. In direct testing, according to Clark, the testing format and
procedure attempts to duplicate, as closely as possible, the setting
and operation of the real-life situation in which the proficiency is
normally demonstrated (p. 10); indirect tests, on the other hand,
do not require the establishment of a highly face-valid and repre-
sentative testing situation (p. 11). He proceeded to give examples
of direct tests in each of the language skills. Direct tests of speaking
proficiency for example, involve a test setting in which the examinee
and one or more human interlocutors do, in fact, engage in a com-
municative dialogue. A direct proficiency test of reading comprehen-
sion would involve the use of authentic magazine articles, newspaper
reports and other texts actually encountered in real-life reading
situations. The main claim that Clark made in his article was that the
validity of the indirect procedures as measures of direct real-life
proficiency is established through statistical correlations:
If and when a given indirect test is found to correlate highly and con-
sistently with more direct tests of the proficiency in question, it becomes
useful as a surrogate measure of that proficiency, in the sense that it
permits reasonably accurate predictions of the level of performance that
the student would demonstrate if he were to undergo the more direct
tests (p. 11).
This claim can be viewed today as a landmark in the controversy
between direct and indirect tests since it became the rationale for

research studies which examined the relationship between direct and

indirect tests. The aim of these studies was to determine if the more
efficient indirect tests could be made valid surrogates for the less
efficient, yet more valid, direct tests.
This question was of special interest at the time since information
about the FSI (Foreign Service Institute) Oral Interview test, until
then considered confidential, was gradually being released by US
Government agencies, (Wilds, 1975; Jones, 1975; 1978). The FSI
Oral Interview was an example of a direct test of oral proficiency
where the test taker and tester actually engaged in a face-to-face
conversation. Although there was little information about the
psychometric properties of the FSI oral test, it was expected that
high correlations between the Oral Interview which was a direct
test, and other tests which were indirect, would show that authentic
oral proficiency could also be accurately assessed by more efficient
indirect procedures.
Results from studies which examined the relationships between
direct and indirect tests (Stevenson, 1974; Clifford, 1977; 1978;
Brutsch, 1979; Hinofotis, 1976; Oller and Perkins, 1980; Oller,
1983) did often point to high correlations between the two types of
tests. Still, it was contended that correlations alone could not be con-
sidered sufficient proof for the claim that the two types of tests were
really tapping the same trait. Strong criticism was expressed about
the reliance on correlation analysis for examining the direct/indirect
questions (Stevenson, 1974; Bachman and Palmer, 1981).
High correlations are obviously needed, but there are other require-
ments as well.
Validating a test against performance on another test, does not always
demonstrate that the test measured the desired construct, for it may be that
one of the tests measures factors quite apart from those of the other test,
which happen to manifest themselves in both tests (Carroll, 1973).

Suggestions were made as to the need for examining the question with
more appropriate statistical procedures which would estimate the
effect of the testing method on the trait which was being measured.
In fact, research studies which used the multimethod multitrait
validation procedure (Stevenson, 1974; Clifford, 1977; 1978; Bach-
man and Palmer, 1981; Brutsch, 1979) did find that the methods of

testing affected the assessment of the trait and therefore it could not
always be claimed that direct and indirect tests are the same. Other
considerations against the use of indirect tests were brought up as
well. Shohamy (1982) made claims that the degree of correlation
may be a result of the instruction that had taken place; there may be
unique aspects of language skills which are not being tapped by

indirect measures. In addition, no diagnostic information can be

obtained from indirect tests; the interpretation of scores obtained is
problematic and, above all, there is a danger that test takers may
start practising the technique of indirect tests rather than the actual
Thus the advantages of indirect tests seemed to be their high
degree of efficiency and their known psychometric properties; the
shortcomings were too high a price to pay for these two qualities.
The general feeling was that it is rather worth searching for direct
tests which would possess good psychometric properties and would
also be efficient and practical. Such tests would require the test
takers to perform the language trait directly, in an authentic testing
environment; they would also bear a closer relationship between
language performance on the test and language performance in real
Subsequent years can best be characterized as the trend towards
the development of authentic language tests, which strove to
resemble, as closely as possible, real-life language performances. Tests
such as the TEEP (Test for English for Educational Purposes) and the
Royal Society of Arts tests are examples of tests which attempted to
be authentic. Even more traditional standardized tests, such as the
TOEFL, have been revised so as to include authentic texts and tasks.
There is no doubt that something approaching authentic language
tests is possible: performing the kind of language which resembles
real language, the language which is needed and used in life, is pre-
ferable to performing the language for the testing of which there is
only a correlational rationale. In authentic tests tester and test taker
can have trust in the language output because it is real language. The
introduction of the concept of test authenticity brought a major
shift in language testing. However, in the wave of enthusiasm towards
the development of authentic language tests, two major problems
have been overlooked. The first is the lack of attention given to the
psychometric properties of authentic tests; the second is the naive
belief that the so-called authentic tests are really authentic. These
two issues will now be addressed and elaborated on.

II The psychometric issue

Authentic type tests are supposed to reflect real language use. How-
ever, the range of variability of real language does not easily lend
itself to the traditional classical approaches of test analysis. Ironically,
it is precisely this variability which makes the need for psychometric
safeguards so urgent and necessary.

Authentic language tests are expected to replicate authentic

language performance. However, real language performance involves
linguistic as well as extralinguistic, social and psychological variables,
all of which operate in constant interaction (Hymes, 1971). This
interaction tends to vary from one context to another. Such varia-
tions in language production have always been noticed by socio-
linguists who claimed that language output is a function of who is
saying what, to whom, when and why; it also depends on the per-
sonality, sex and state of mind of the partners, their role-relationship
in the communicative act as well as the time and locale in which the
communicative act takes place.
. In authentic testing situations such variations are also part of the
test. Ihus the role relationship of the tester and test taker, the
personality and sex of the tester, the function and purpose of the
speech, etc. are all integral parts of the test situation and are likely
to cause variations in the language output produced in the test. These
variations introduce a major source of difficulty in language testing:
they have to be taken into account in order to make sure that the
test results can be replicated with a high degree of stability and
accuracy (i.e. the reliability of the test).
However, the question of the extent to which these communica-
tive variations manifest themselves in testing situations and the
extent to which the assessment of the language is affected by these
variations have hardly been addressed so far.
In indirect unauthentic language tests, the variations that exist in
natural language use did not come across. Indirect tests screened
these variations out, excluding most of the real-life variables. For
example, by conducting a conversation with a tape recorder, the
variable of the tester was held constant: it was always the same
inanimate black box. Therefore, the assessment was more stable and
did not express the variability which exists in real conversation. It
was therefore much easier to obtain stable and reliable results on
such tests. _

In the authentic testing trend, on the other hand, testers want to

get the whole truth, to elicit authentic language in which all the
variables - linguistic as well as non-linguistic ones - are included.
They therefore encourage the incorporation of everything that is
involved in real-language communication. The result of this effort
which has to be faced is, that the language output becomes more
complex, more varied and therefore more difficult to control (see
Seligers article in this volume).
In a study on the stability of oral expression on one test, Shohamy
(1983) found that the test taker had relatively low probability of

obtaining the same score on the same oral interview test when he was
tested by two different testers on two different occasions. Moreover,
when the speech style changed from interviewing to reporting the
probability of obtaining the same score in oral production was even
lower. In another study (Shohamy et al., 1983), it was again found
that when four different speech styles were tested, as exemplified
in an interview, a reporting task, various role-play situations and a
group discussion, most test takers did not obtain the same scores in
all the four oral interactions. These findings indicate that the varia-
tions which exist in natural language performance do, in effect, come
across in authentic tests. These variations, however, point to problems
of reliability and variability of authentic tests. It seems that the
stable and reliable scores obtained from indirect tests were partially
due to their low validity: they were not tapping the broader and
fuller construct of real-language use.
Since in language testing we cannot afford to have either low
validity or unstable and unreliable results, we are confronted here
with a serious fundamental problem which has to be solved. One
approach would be to construct special tests for every language
interaction, which would in combination, reflect all the possible
language variations. This, however, would obviously be an almost
inconceivable task. A more realistic approach would operate under
the assumption that in language output there are some stable elements
which run across all authentic language performances, as well as
others which are variable and specific to each of the language per-
formances. This approach requires the estimation of both the stable
component of language output and the specific fluctuating elements
of a number of commonly used language interactions. The estimation
could be done by administering different types of language tests
which will invariably include the elements believed to be stable,
while manipulating those elements which are believed to be specific,
such as the testers status, the speech style, the environment, the
mood, etc. By comparing the results, in terms of scores obtained on
such tests, it may be possible to estimate the stable elements versus
the specific ones of various language interactions. This procedure will
hopefully help to solve the problem of reliability. In the meantime,
however, the accuracy of the scores obtained from the authentic
tests remains questionable.
There are also other psychometric problems which are unique to
authentic tests, such as the issue of dependency of items: real-life
language performance is integrative and therefore does not consist of
independent items. Raatz, in his paper in this volume, deals with
some aspects of this issue and suggests various statistical approaches

for handling the problem. Since, in any case, the accuracy of authen-
tic test scores is still an issue to be solved, users of such tests should
be very cautious in interpreting their results.

III Theauthenticity issue

The second difficulty arising from these types of tests relates to the
naive belief that language produced in the course of an authentic
communicative test is a true and exact representation of real-life
language. Stevenson and Spolsky in their papers in this volume have
already pointed out that the language obtained on language tests is
not more than authentic test language which is not the same as real
life language. Taking this position as a point of departure, we will
attempt therefore to analyse the differences between authentic test
language and authentic real-life language. We identify five main
factors which create differences between the two, referring by way
of illustration to a group of authentic tests administered to high
school students. Specifically, these tests consisted of an interview,
a role play (including a number of speech acts and functions), a

reporting task and a group discussion. Ihe first three tests were
administered on a one-to-one basis, while the latter involved four
test takers interacting in a group. The five factors considered respon-
sible for reducing the authentic language into authentic test
language are: the goal of the interaction, the participants, the test
setting, the topic and the time of the tests.
1 The goal of the interaction,
We refer here to the fact that in real-life situations people may inter-
act for various purposes, none of which is to obtain a score for their
language performance. In a test, on the other hand, both the tester
and the test taker know that the only purpose of the interaction is
to obtain an assessment of the test takers language performance. The
tester evaluates the test takers performance with a score which may
have serious bearing upon the test takers future. They both know
that they would not have been in the specific situation created by
the test, had there not been a need for an assessment of the test
takers language performance. While in real-life situations participants
ignore the quality of the language in favour of transmitting the mes-
sage, in a test the quality of the language produced is the central
issue. Thus, both parties are clearly aware of the artificiality of the
test situation they are involved in, which heavily imposes its con-
straints as well as its consequences.
Even in tasks which are communicative and resemble real-life

interactions, the test taker is constantly aware of the fact that the
goal of the interaction is the evaluation of the language produced.
The issue involved in the goal of the interaction is exemplified in
each of the oral tests as follows: in the oral interview the tester is
seldom interested in the test takers opinions; he elicits them only
in order to obtain sufficient language for assigning a score. In the role
play various roles are assumed by both test taker and tester, only in
order to facilitate the production of a wide range of speech acts
which will provide enough language to enable the tester to assign a
score. In the group discussion partners debate a controversial issue,
not really to convince one another or to reach a consensus. In the
reporting task the goal of the test takers performance is not to
convey the content of an article to the tester, but rather to perform
a language task. In all these examples the genuine goal of the inter-
action is to prompt the test taker to produce sufficient language so
that he may be assigned a score. Another aspect of this same factor
relates to the rating scale on oral tests. The constant awareness of
the use of the rating scale by the tester may substantially influence
the test takers language performance.
Thus, unlike real language use, the real goal of the interaction in a
test is the test itself. This undeniable fact is likely to impinge upon
the genuine authenticity of the language produced.

2 The participants
By referring to the participants as one of the factors responsible for
reducing the authenticity of the language used in a test, we recognize
the fact that the tester and test taker would not necessarily be
involved in a similar communicative act with one another in real life.
In the one-to-one tests (interview, role-play, reporting), it is
probably the first time the tester and the test taker have met. They
may be coming from very different and mutually unknown back-
grounds, and they are probably not used to talking to one another.
These factors make the interaction artificial, awkward and difficult.
The unfamiliarity is especially noticeable in the oral interview, in
which the test taker is asked personal and often private questions by
someone he has never met before. For test takers who are not used
to talking openly to strangers, this may be a very embarrassing and
restricting situation: we rarely tell strangers our opinions, thoughts
or problems in first-time encounters. This may therefore have a con-
siderable effect upon the language produced.
In the group discussion the participants know one another, and are
used to having conversations among themselves. In this test, however,

the is caused by the fact that the participants are called

upon to talk to another in the foreign language, a language which

they never use for real communication among themselves. This

obviously reduces the authenticity of that test.
3 The settling
By setting we the physical environment in which the test
refer to
takes place. In
a situation the setting is either an office or a
classroom where the tester and the test taker interact, usually with a
table separating the two, whereas in real life these types of conversa-
tions take place in more informal settings, such as the street, a store,
a living room, etc.
In the role-playtest, for example, the test taker may play the role
of someone who accidentally meets a friend and inquires about a
room to rent. In the test the conversation is not carried out in the

particular setting predicted by the topic of the interaction, the one in

which the interaction would have happened in real life. In addition,
the tape recorder often used during the oral test (for interscorer
reliability checking purposes) may introduce a further violation of
authenticity which can affect both the testers and test takers
language performance.

4 The topic
Topic refers to the content of the conversation in the test. In real-
life communicative situations the topic is determined by both
participants, usually in an unplanned manner; it evolves from the cir-
cumstances, the environment and the common background of the
participants. However, in a test situation the topic is determined and
imposed by the tester. In the oral interview the tester often plans
ahead of time what he will be asking in order to conduct the inter-
view. In the role-play test the tester determines for the test taker
what role he will play; in the reporting task the tester decides what
the test taker will be reporting on, and in the group test what the
group will be discussing. These situations hardly ever happen in real-
life communication, where the topic of the interaction is most rarely
determined by external considerations ahead of time. Imposing the
topic artificially is very likely to have an impairing effect on the
authenticity of the language.

5 The time
Time refers to the time limits imposed on the tests. Whereas in most

tests there is a limit of time i.e. the test taker must begin and com-
plete his task in a given period of time, in real-life interactions time
does not play such an important role. It is very likely that the time
limit has some effect on the quality of language produced. Also,
different communication strategies are probably set in motion under
the pressure of limited time.
These five factors, the goal of the interaction, the participants, the
setting, the topic and the time of the test, can be considered threats
to the authenticity of the language produced on these tests. Each of
them may violate the authenticity of the language produced by the
test taker, so that the score assigned as a result of the test takers
performance on the test is most probably not the true reflection of
the real oral proficiency underlying it. In other words, the test
takers performance on the authentic communicative language test is
not likely to be a true manifestation of his oral language competence.
Nevertheless, the language output in communicative tests which is
meant to approximate real-life authenticity is in all probability more
authentic than the one produced in tests where talking to a tape
recorder was the standard situational context. There is, however,
still a gap in terms of authenticity between the language produced
on any test whatsoever and the language used in real-life situations.

IV Conclusion
In this paper we first reviewed the trend of development towards

authenticity in language testing. We have shown how language tests

moved from the indirect era towards becoming more direct and
more authentic, attempting to elicit the language used by real people
in real life. It was pointed out, however, that although these tests
are more face valid, i.e. they seem to resemble more the actual

language used in real life, they still have some major deficiencies. One
such deficiency is the lack of measurement and statistical analysis
and the limited empirical evidence to show their psychometric
qualities. There are also difficulties in trying to impose classical
measurement theories on these unique types of tests which aim at
tapping the whole construct of language performance. It was also
pointed out that the language of authentic tests is not a true repre-
sentation of real-life language. It was shown that it is difficult, if not
impossible, to even approximate real-life language use on language
tests. The most we can obtain, at the moment, is authentic test
Two alternative suggestions can be made. If we insist on eliciting
authentic real-life language and not only authentic test language

we should adopt an ethnographic approach, i.e. observe language

use in a number of natural situations, without the test taker being
aware of the assessment of the language performed. If, on the other

hand, this is found to be too complex and cumbersome, and there-

fore unrealistic, we have to admit that obtaining authentic real-life
language is beyond reach, and compromise on authentic test language.
In this latter case, however, we have to make sure that the test is
fully test-proof, i.e. that it meets all the necessary psychometric

V References
Bachman, L.F. and Palmer, A.S. 1981: The construct validation of the FSI Oral
Interview. Language Learning 31, 67-86.
Brutsch, S. 1979: Convergent/discriminant validation of prospective teachers
proficiency in oral and written production of French. University of
Minnesota, doctoral dissertation.
Carroll, J.L. 1973: Foreign language testing; will the persistent problems persist?
In OBrien, M.C., editor, ATESOL testing in second language teaching:
new dimensions, Dublin: The Dublin University Press.

Clark, J.L.D. 1975: Theoretical and technical considerations in oral proficiency

testing. In Jones, R.L. and Spolsky, B., editors, 1975.
Clifford, R.T. 1977: Reliability and validity of oral proficiency ratings and
convergent/discriminant validity of language aspects of spoken German
using the MLA cooperative foreign language proficiency tests, German
(speaking) and an oral interview procedure. University of Minnesota,
unpublished PhD dissertation.
1978: Reliability and validity of language aspects contributing to oral pro-
ficiency of prospective teachers of German. In Clark, J.L.D., editor, Direct
testing of speaking proficiency: theory and application, Princeton: Edu-
cational Testing Service.
Hinofotis, F.B. 1976: An investigation of the concurrent validity of cloze test-
ing as a measure of overall proficiency in English as a second language.
Southern Illinois University, PhD dissertation.
Hymes, D. 1971: On communicative competence. Philadelphia: University of
Pennsylvania Press.
Jones, R.L. 1975: Testing language proficiency in the United States Govern-
ment. In Jones, R.L. and Spolsky, B., editors, 1975.
1978: Interview techniques and scoring criteria at higher proficiency levels. In
Clark, J.L.D., editor, Direct testing of speaking proficiency; theory and
application, Princeton, New Jersey: Educational Testing Service.
Jones, R.L. and Spolsky, B. editors, 1975: Testing language proficiency. Arling-
ton, Virginia: Center for Applied Linguistics.
Oller, J.W. Jr 1983: Issues in language testing research. Rowley, Massachusetts:
Newbury House.

Oller, J.W. Jr and Perkins, K. 1980: Research in language testing. Rowley,

Massachusetts: Newbury House.
Shohamy, E. 1982: Predicting speaking proficiency from cloze tests: theoretical
and practical considerations for tests substitution. Applied Linguistics 3,
1983: The stability of the oral proficiency trait on the oral interview speaking
test. Language Learning 33, 527-40.
Shohamy, E., Reves, T. and Bejarano, Y. 1983: An integrated test of oral pro-
ficiency : from research results to educational policy. Paper presented at
the Fourth ACROLT Meeting, Qiryat Anavim, Israel.
Spolsky, B. 1978: Introduction: linguists and language testers. In Spolsky, B.,
editor, Advances in language testing Series 2, Arlington, Virginia: Center
for Applied Linguistics.
Stevenson, D.K. 1974: A preliminary investigation of construct validity and the
test of English as a foreign language. University of New Mexico, PhD
Wilds, C. 1975: The oral interview test. In Jones, R.L. and Spolsky, B., editors,