Anda di halaman 1dari 20

Advancing Applied

Linguistics: Assessment
Carol A. Chapelle
English Language Learning and Technology (2003)
John Benjamins Publishing Company
Amsterdam
Notes by Sajit M Mathews
With the introduction of technology into testing, the speed of development
seems to be increasing, sometimes leading to tunnel vision (Brown and
Duguid, 2000)
Computer assisted language assessment should help researchers to identify
and explore theoretical issues in assessment.
Technology can give unique perspective on two central theoretical issues:
construct definition and validation.
But efficiency should not be the sole goal of research if this has to be
accomplished.
CALT- Computer Assisted Language Testing
CALT seems to be moving fast to the goal of faster and more efficient life as
imagined by technologists.
By 2000, CALT was trying to develop computer adaptive tests (aim: more
efficiency than pen and paper tests)
Brown: CALT uses a delivery algorithm to select items based on prior
experience and performance of the learner. It terminates test based on
consistency in answering. It reduces testing time and maximises efficiency by
scoring based on consistently correct answers.
Alderson (1991) suggested a number of ways in which technologies can
expand the capabilities of language tests by, for example, measuring time and
navigation patterns during test taking, storing and presenting information in a
variety of ways, and analysing learners language.
In the past 20 years (of 2003), efficiency was the major theme discussed.
Bachman (2000) contains a section on language testing: the new task formats
and modes of presentation that multi-media, computer-based test
administration makes possible raise all of the familiar validity issues, and may
require us to redene the very constructs we believe we are assessing.
principles for making inferences based on observed performance
Related questions: how inferences are justied, how test scores are
computed, and how they are used, as well as questions about the best ways
of dening language constructs
theoretical issues in language assessment grow out of practical needs for
developing tests and justifying their use, but are not limited only to those
questions of immediate relevance for particular tests
language assessment theory is theory for practice
Thus, an appropriate starting point is noting practical questions that come up
in computer-assisted tests: how the construct that a test measures is dened,
and the process of validation
Construct denition
Decision about degree of learner choice during test-taking, and scoring
constructed responses depend on the test developers precise definition of
what the test is intended to measure. This is fundamental for test design and
validation.
Test task design: based on what inference is to be made on learners
performance on them
So language abilities that underlie test performance is discussed
Validation research: a form of hypothesis-testing which examines hypotheses
concerning the relationship between test scores and inferences associated
with those scores
centrality of construct denition in language testing
Major questions researchers ask: what language constructs consist of and
perspectives for dening them
A technology-mediated English test is a particularly well-tting straightjacket
for theory because of the range and detail of elements in computer-assisted
test design
Example: whether a reading test should have options to go back and read the
passage while answering questions. The act/competence of going back and
forth will affect the overall test performance.
In pen and paper test, the factor that can be controlled is time. But in
computer-assisted test, there are other factors, and can be affected by the
design decisions made by the designer.
So construct definition is very important.
Decisions about design of procedures for computer-assisted scoring of
constructed response items are also problematic.
Scoring need to take into consideration only a certain parameter that is
measured by the computer- not all. This selection is left to the designer- how
to decide?!!
It is computers that have given us the detailed data of the process of test-
taking. This enables to select the construct elements and evaluate the
performance based (only) on them.
Ex: speaking test: manual evaluation usually gives overall performance a
rating. But computer system can pick out specific elements mentioned in the
construct definition, and score what was being tested.
Validation
Are inferences and use of computer-delivered language tests valid?
The signicance of test method in inuencing examinees performance is well-
documented in the research on second language testing
In this research, it is assumed that any influence of test method on test score
is undesirable.
If there is a difference between scores of pen and paper, and computer-
delivered test, the difference is taken as an evidence that the computer-
delivered test is not valid.
The possibility that language ability for engaging in technology mediated
communication should be expected to be different than what is required for
performance in other registers

A second validation issue ensues from use of technology-mediated methods in


SLA research.
This spotlights questions about inferences made from performance.
Measurement is not the focal concern in SLA.
Assessment concepts evolve out of educational measurement. There is a
growing disparity between how they are used in SLA and in the measurement
literature.
Should the standards for educational and psychological measurement pertain
to the validation of SLA measure?- this question will be repeatedly asked
when new measures through the use of technology are used.
A third validation issue
It is based on the concern about negative affects that such assessments might
have on learners
Affect (the extent to which the test causes undue anxiety) was added along
with reliability and validity as test qualities to be investigated (1980s)
Washback was also added by the end of 1980s to the list (Washback is effect
of the test on the process of teaching and learning)
By 2000, role of consequences was also added. i.e., validation should take
into consideration the consequences that it has on those who use it.
This is problematic since it goes beyond the test and its score, to its impact
on test users.
The type of consequences that should be investigated and the manner in
which consequences might best be identied remain topics for further
investigation
Probing construct denition
1. The test design-construct connection
Interface design, meaning of the construct under consideration, use of the
test, practicalities of test delivery
The computer-assisted test can allow examinees more or less freedom in how
they access the input, and the question is whether such freedom will help or
hinder in obtaining performance that will reect the construct.
Two major issues: the inadequacy of the conceptualization of both the
listening comprehension trait (to be measured) and the contexts (in which
users listen in life) of listening of interest to the test user (continuum)
And a middle ground is not well understood in language assessment theory.
decisions have to be made about whether the trait end of the continuum
should be probed, adding to the psycholinguistic detail of the construct, or
whether the contexts perspective is more fruitful.
A psycholinguistic description would give them more information than they
could deal with. An analysis of contexts will give a number of contexts with
many variations.
These facts are a good starting point, but dont offer direct answer about
construct definition.
The tugging from each end of the continuum is what forces applied linguists
into guring out how to work within the middle ground
Test design question being asked today will deliver answers soon in the
future.
Answers will be found if research extends beyond efficiency.
The test scoring-construct connection
Computational linguists are called in to create programmes that can rate
constructed test responses, like human raters. Developers may not know what is
being measured in the construct definition.
Therefore, we need to link scoring to construct definition.
Scoring Dictation
An algorithm that needs to score a written response needs much more input than
one that scores a right/wrong response.
Questions that should be asked include: what makes a response correct? Which
responses are more correct than others? What is the basis of such decisions?
Coniam (1998) attempted to create a more precise scoring method. He noticed
the absence of a clear rationale for partial scores.
Future work attempting to score examinees dictation responses more accurately
will need to dene the construct that the test is intended to measure more
accurately.
Scoring vocabulary
Some attempted scoring response to a C-test. A variety of responses are
possible with errors included, that are indications of vocabulary knowledge.
The response can be used to know aspects of examinees response that are
incomplete, incorrect, that can inform diagnostic scoring.
Partial scoring can be applied to vocabulary also if we define how important
each element of correctness is for the construct definition.
The rationale for the diagnostic or partial score algorithms would have to be
based on the construct denition
The question how detailed should a construct definition should be is tacked
by linking the specificities of a scoring algorithm to the construct definition.
How can a developer decide the level of details of a construct definition? For
this, we need to define the term construct itself.
A construct can be dened as a meaningful interpretation of performance
(this definition links construct to observable performance)
This link offers some grounding to a question: what is language.
How detailed should a construct definition be in order to explain the
performance of a given language test.
To define a construct that can be operationalised, one needs details of
performance- this needs the help of computers since all human raters can do
is get an overall holistic judgement about performance.
Educational Testing Service has developed an e-rater that can evaluate
essays.
Computer assisted scoring should be considered on different grounds than
efficiency oriented research.
Validation
Educational assessments
Can we justify the equivalence of computer-based test and pen-paper test?
Will it be a test of computer proficiency as well as language proficiency.
The threat of the computer as a potential contaminant for a genuine score of
language ability
Questions: What should the computer-assisted language test best be designed
to measure for its intended use? Do test developers really intend to use the
capabilities of new technologies to measure the same language constructs as
the unsophisticated language tests of the past did? In what ways does the
computer-assisted test require us to redene the very constructs we believe
we are assessing?
One has to look beyond equivalence to find the real validation criteria.
Investigating the Language for Specific Purposes constructs that such tests are
intended to measure requires developers to look beyond the canonical view
of test method as creating a negative effect
Computer-assisted test methods require an even more delicate understanding
of the problem, since the challenges faced by examinees could be different in
case of computer-based tests.
Relevant question to be asked is whether the computer-assisted tests are able
to bring out the abilities that are valid for making the desired inferences
about examinees.
Assessment in second language research
Research intended to justify the validity of inferences is relatively uncommon.
What are the accepted procedures for validation of assessment tasks in L2
research?
Some tasks are the same on computer and paper: ex- ticking the right answer
Problem is when assessing examinees strategy while he/she works on something
else- based on process data.
Hegelheimer and Chapelle suggest recording mouse clicks as a measure of noticing
particular words in a text. Clicking and post-test knowing the word will be related.
Does mouse click indicate noticing? (usual noticing doesnt ask for proof!)
The technology-mediated task seems to attract more suspicion and therefore
require proof
What is interesting about novel technology-based assessments is that they seem to
sharpen our critical eye for questioning the justication of the inference.
An expanded set of methods for gathering SLA data raises the question of
what the data measure, which in turn problematizes the issue of what should
be accepted as validity evidence.
Validation and consequences
Researchers and teachers worry that examinees with little or no experience
with computers will suffer from heightened test anxiety when they are asked
to take a computer-based test
This led to studying the consequences of these differences, and computer-
assisted testing itself.
Data-based analysis of this and other anxiety related problems should be done
Validation theory says, include potential positive effects also.
how computer-mediated language and literacy practices are shaped by
broader institutional and social factors
what these new practices mean from the perspective of the learner
Today, as in 1986, much theoretical and empirical work is needed to bridge
from current technological capabilities to progress in language assessment.
The rst step, it appears, is to set aside efciency as the primary criterion in
CALT research in order to seek solutions to the substantive issues of construct
denition and validation

Anda mungkin juga menyukai