Best Practice in Localising and Adapting Assessments in High Stakes Scenarios

Introduction
The search for talent is increasingly an international

effort in the global economy in which we live and work
today. What that means for effective assessment of
talent is an increasing need to ensure that
assessments operate effectively and equivalently in
the languages in which the assessments will
be delivered.
Why? Well, first, companies are moving across
geographical borders to develop their businesses and
to respond to economic challenges and opportunities.
Even if you operate in only one national market, you
are still likely to be faced with the need to conduct
assessments in a variety of languages as the talent
that companies are looking for is also moving across
borders in search of employment and development
opportunities. The need for equivalent quality of
assessments across languages is therefore more
important today than ever before.
The purpose of this white paper is to share the
learning from a four year programme involving over
700,000 words of content and an investment of over
2 million by SHL in developing and evaluating
processes for localising and adapting ability tests and
personality questionnaires. To gauge whether our
efforts have been of a sound scientific quality, we
have presented our work through well attended
workshops at various leading professional events.
Amongst the events we have presented at are the
Association of Test Publishers Conference in Dallas,
US, in 2008, and in Palm Springs, US, in 2009, the
European Association of Work and Organizational
Psychology Conference in Stockholm, Sweden, in
2008, and the International Test Commissions
Conference held in Liverpool, UK, in 2008, and at the
European Congress of Psychology in Oslo, Norway,
in 2009.
The feedback that we have received from both the
testing and translation professions has been very
positive, and has encouraged us to share our learning
through this paper which we will update as we
continue to develop new processes as well as new
insights into the critical issues of effective localisation
and adaptation of assessments.
We would like to acknowledge existing guidelines for
test adaptation as provided by the International
Testing Commission (ITC
1
) as well as publications that
helped us enormously at the start of our journey four
years ago
2
. We would also like to thank those such as
Dave Bartram, Ron Hambleton, Robert Roe, John
Bateson, Hinrik Johannesson, Dave Wright and many
SHL White Paper 2010
Assessment in a global context:
Best practice in localising and
adapting assessments in high stakes
scenarios.
Eugene Burke and Carly Vaughan
>
1
The current version of the ITCs guidelines for the adaptation of tests can be obtained from
http://www.intestcom.org/Guidelines/Adapting+Tests.php which we accessed on February 1st. 2010.
2
Hambleton, R. K., Merenda, P. F, and Spielberger, C. D. (2005). Adapting educational and psychological
tests for cross-cultural assessment. Mahwah, N. J.: Lawrence Erlbaum Associates.
SHL Group Limited 2010. All rights reserved. SHL and UCF are registered trademarks of SHL Group Limited.
www.shl.com SHL White Paper 2010 / Page 2
other colleagues who have helped us in formulating
the ideas shared with you in this paper through their
suggestions, comments, critiques and questions, as
well as helping us to answer some of the difficult
questions we have sought to answer on our journey
so far.
We would also like to thank our translation partners,
CLS Communications, for contributing their wealth of
experience as well as their patience in taking this
journey with us. In particular, we would like to thank
Pablo Navascus and Nickola Kakanskas for their
industry and attention to quality in supporting
our work.
We would also encourage you, the reader, to share
your reactions, thoughts and comments on this paper
with us as we freely acknowledge that this is an area
of science and practice that needs continuous effort
to improve upon, and that our understanding will
develop with the insights of others. Please contact
either of the authors to share your comments and
thoughts. Their emails are:
eugene.burke@shlgroup.com and
carly.vaughan@shlgroup.com
The importance of localisation to effective
multi-language assessment
What is localisation and why should it be important
to me? Consider a scenario in which assessments
(such as ability tests or personality questionnaires but
the list could include interviews and observational
exercises) are being administered to candidates in
several languages. These tests and questionnaires
have been designed to reliably distinguish between
people in terms of the talents they offer. In this
scenario, differences in scores are expected because
the assessments have been designed to find those
differences, but what if there are patterns in the
differences between groups attributed to language or
country of origin. The question that then follows is
whether those differences in scores are due to true
differences in talent (is it a difference due to
differences between people) or are those differences
just noise resulting from poor localisation (a
difference in the language versions of the
assessments)?
At the risk of oversimplifying, a score is the result of
the interaction between a person and a test,
questionnaire or some other form of assessment.
The effort that goes into building reliable and valid
assessments seeks to reduce the amount of variation
in scores due to errors or noise by ensuring that the
content is designed to function well psychometrically,
that instructions are clear and that the process for
taking the assessment is easy to understand and to
follow. However, no assessment is perfect and there
will always be a little noise in the data, though as
mentioned, good assessment construction seeks to
keep that noise to a minimum
3
.
That is, generally, the case for the development of an
assessment in one language. But, what happens when
we add the complexity of several language versions to
the existing demands of ensuring sound, equivalent
psychometric quality to the assessment irrespective of
the language it is deployed?
The danger in localising an assessment into another
language is that the process introduces noise such
that differences in scores reflect a poor localisation
rather than true differences between people. As such,
effective localisation processes need to ensure that
the assessment measures talent in an equivalent way
across different languages, and that any differences
between people are meaningful and not just
translation noise.
Isnt this just about translation of assessment
content? You may have noticed that we have only
used the term translation once in the past few
paragraphs. This is deliberate and, to be clear,
translation is not localisation but is one necessary
element of an effective localisation process. We will
explain this later in this paper, but if an assessment
has merely been translated then we would argue that
is an insufficient way to mitigate and pre-empt the
issues in multi-language assessment that we have
briefly outlined above. This is for a very simple
reason. Unlike documents or publications, tests and
questionnaires and other forms of assessment seek to
measure psychological attributes of people such as
their level of reasoning ability, their behavioural style
or their motivations to use but a few examples. These
psychological constructs, what we define and evaluate
as the attributes measured by an assessment, require
>
Assessment in a global context: Best practice in localising and adapting assessments in
high stakes scenarios.
3
The reader might think of a reliability coefficient such as internal consistency, alternate forms or, in the context of observational exercises
and interviews, interrater agreement or consistency. When looking at language equivalence, the metrics involved are often more complex
than a reliability coefficient, but essentially the concern is the same. That is, language versions operate to an equivalent level of
psychometric quality.
www.shl.com
careful and diligent processes when being localised
into another language that ensure key meanings and
interpretations have not been changed or distorted.
So, what is a localisation process? In our view, it is a
process that ensures that variations in the accurate
measurement of a construct are kept to a minimum
across all language versions of an assessment, and is
a process that includes an evaluation of whether any
such variation is likely to have a material effect on a
persons score and, subsequently, the likelihood of
that person realising an employment goal or
opportunity (e.g. whether they are selected or are not
selected for a job).
In this paper, we will focus mainly on ability tests
4
, but
the principles that we will describe apply generally to
all assessments used in the world of work and that are
used to inform employment decisions. Without clear
evidence that a localisation process actively and
consciously manages the risks of non-equivalence
across language versions, then we would argue that
the user of multi-language assessments runs the risk
of making inconsistent employment decisions. If
those inconsistencies are apparent to candidates and
stakeholders within the organisation, then those
inconsistencies will have consequences in terms of
reputation both for the organisation as an employer
and for the fairness and the perceived value of the
assessment process.
An overview of the key steps of the SHL
localisation and adaptation process
What are we trying to measure? Localisation has to
start at the beginning with what the assessment is
designed to measure. This may seem a rather obvious
statement but is one that in our experience is often
overlooked. Increasingly, we have developed content
not in one language but in several to test that the
constructs we are measuring have the same meaning
and functioning across languages and across cultures.
It is important to note that the purpose of localisation
is not to remove or reduce any differences across
languages or cultures that may be sensible and
meaningful, but to ensure that the content we use to
measure those constructs functions equivalently
irrespective of the language it is administered in.
Make sure that those involved in localisation have the
right skills and experience. We require all translations
to be undertaken by native speakers who are
professionally qualified as translators, have a
minimum of five years experience of translating from
the target language back to the original language of
the test or questionnaire, and are living in the country
of the target language. Why do we insist on the last
point? Well, everyone adapts to new countries and
cultures after a while, and so we want to ensure that
the language used in a localised test or questionnaire
reflects the context of the country or countries within
which that language is most frequently found. For
example, certain terminology that is used in the UK is
not always applicable in other languages, however, a
translator who has lived in the UK may have become
distanced from the current language most often used
in the target language and believe that a UK term is
applicable or translatable.
Make sure that everyone understands his or her role
and the tasks that they are responsible for. We take
all translators and project managers at the translation
agency through a structured training programme in
which they are taught which elements of the content
are not only critical to the meaning of a question, item
or scale in the assessment, but how particular
elements of that content are important for how the
assessment functions. For example, for verbal
reasoning tests, certain words or phrases will be
critical to the questions asked. This contributes to the
difficulty of those questions. Variations in the
language used can therefore affect the difficulty of a
question across different languages.
The training that we undertake is completed by a
short test translation which is reviewed by SHL
localisation staff to check that the translator has a
good understanding of their task. Translators who do
not pass this test translation are replaced by others
who then go through the training programme and the
test translation.
>
4
This is purely for the sake of brevity.
SHL White Paper 2010 / Page 3
Identify localisation risks from the start. We begin our
process with a localisation risk review in which we
require the translators to review content for any
words, phrases, sentence structures and expressions
that could prove problematic in the target language.
This enables us to identify problems before the
translation stage begins and helps to identify any
content that may need to be adapted to function in
the target language. Adaptations are undertaken by
trained SHL staff in collaboration with the translators.
Check, check and then check again. We require a
minimum of two translators per localisation, with each
translator responsible for half the content and with
each translator reviewing the other translators work
to ensure consistency of terminology. This operates
on a similar best practice principle to that used in
software development where development and
checking responsibilities are split between two or
more developers to ensure that software is tested in
the process of development. This approach also
encourages discussion and helps to produce a
translation of optimum quality by sharing thoughts
and ideas. The translation is then subject to a further
review by a translator who has not been involved in
the original translation. They too are subject to the
same training as described above.
Set common goals for everyone involved. All reviews
at this stage are then combined through a
harmonisation process involving a manager from our
SHL team, the project manager and the three
translators. At this point, any issues are resolved and,
where necessary, further adaptations are undertaken.
This is an important stage as it ensures that any
changes made to the translations are in line with key
psychometric principles relevant to the assessment.
And then check again. The translated content is then
subjected to another independent review by bilinguals
in English (our corporate language at SHL) and the
target language. This review is conducted on the test
in its final form. For example, if it is an online ability
test, the bilingual will take the test on a computer in
the format that it is intended to be delivered. This not
only helps to ensure that the translation is suitable for
the mode of deployment, but also puts the test into a
clear context. Any issues at this point are then fed
back for further work by the translation team.
As this and the preceding steps described above
hopefully show, we have engineered our process to
place the largest investment in the earliest stages of
the localisation process and long before we collect
data on the localised content. Our experience is that
it is probably too late to rectify problems found at the
data collection stage, or that changes at this stage will
result in significant costs and delays. Throughout the
process outlined above, we have developed a number
of lead indicators that enable us to manage the risk of
content failing checks at the data collection stage of
our localisation process.
Evaluate the localised content. Once we have final
content ready for trial and data collection, we then
administer this content to large samples of native
speakers. For example, we have thus far trialled
localised versions of Verify
5
reasoning tests with over
9,000 native speakers in Chinese, Dutch, Finnish,
French, German, Italian, Norwegian and Swedish, and
similar large samples have been used to evaluate the
equivalence of different language versions of the
Occupational Personality Questionnaire (OPQ
6
). These
evaluations operate as a hard check of measurement
equivalence and whether the preceding steps in the
localisation process have been effective.
The specifics of the analyses conducted to evaluate
the equivalence of assessments across languages vary
according to the nature of the constructs being
measured. What we look for through these analytical
models is evidence of whether a question or a scale is
biased against a particular language group. That is,
for ability test items as an example, whether the
probability of answering a question correctly is higher
or lower for a particular language group over and
above the ability a test is designed to measure (and
there are models for evaluating whether measurement
of the ability or construct is equivalent prior to
focusing on a specific item or question).
The models used to evaluate the equivalence of
questions or items include a factor for the ability of
respective language groups, a factor for language (e.g.
English or Arabic) and a factor that looks at the
interaction between ability and language (this
identifies whether we have a form of bias that is
specific to particular combinations of ability and
language). When the probability of correctly
>
5
You will find details of the Verify Portfolio of Reasoning Tests via
www.shl.com/OurScience/TechnicalInformation/Pages/TechnicalManualsandGuides.aspx where a free download of the technical
manual is available.
6
The manual for the OPQ can be downloaded for free from the web address provided in the previous footnote.
answering an ability question is shown to be
associated with language, then this is taken as
evidence of Differential Item Functioning or DIF
7
.
Differences can be OK. DIF is a fact of life.
Attempting to have questions or items that are all free
from some variation by language or country is, in our
view, impractical. The issue is really the extent to
which a bank of questions or items contains items that
are substantially biased (show DIF) and the direction
of any bias found. Our analysis therefore evaluates
the frequency, size and the direction of DIF across all
questions. Very often, a bank of ability questions will
contain some items that favour one group while other
questions favour another group. High levels of DIF
would show a need to review and possibly replace
questions. To date, the results for Verify suggest
levels of between 3% and 6% of item banks showing
DIF, with questions showing DIF evenly split in terms
of the groups that they favour (effectively cancelling
out any effect due to DIF at the overall score level).
Look at the material impact of any differences due to
language. The final step in our process is to model
the impact of differences by language to see whether
those differences will have any material effect on
outcomes when that assessment is used to inform
employment decisions. We will again turn briefly to
the case of ability tests. Our models take into account
two key aspects of live assessments: the deployment
model that will be used or how the content will be
administered, and the usage model by a client of
which a key aspect is the cut score used to determine
outcomes. In the case of Verify, content is deployed
using a randomised test model to strengthen test
security. In this model, candidates are assigned to
different tests containing different questions that are
constructed in such a way as to ensure that all
candidates receive an equivalent level of difficulty in
the test they receive. So, we can evaluate the extent
to which candidates are likely to receive items
showing DIF under this deployment model. We can
also then look at whether worst case scenarios, even if
unlikely, will result in a candidate having a lower or
indeed higher probability of passing a cut score
dependent on the language version that candidate
sits. In our evaluations of usage models by language,
we have obtained worst case differences due to
language version of an additional 0.18% in errors (i.e.
increase in the probabilities of false positive or false
negatives) or a likelihood of less than 1 in 556 of such
an error occurring.
Keep monitoring the functioning of your assessments.
But our work does not stop there and continues once
a test or questionnaire has been released with regular
checks on the performance of assessments by
industry, job type as well as by demographic including
language.
The ten principles of effective localisation
and adaptation of assessments
We offer the following by way of a summary of our
journey to date and the learning we have gained thus
far. We hope that this summary offers a practical
contribution to an essential aspect of assessment in
todays global economy: ensuring that assessments
offer the same quality and accuracy in identifying a
persons talent and potential irrespective of the
language in which that potential is assessed.
1. Define the structure of the content to be adapted.
Which aspects of assessment content are critical
to the functioning of the assessment (fixed
constraints they must operate to) and which
aspects are less critical (i.e. those aspects with
which translators are allowed more freedom to
exercise their own personal judgment)?
2. Insist that the translation agency and those
responsible for QA are trained to understand the
structure and functioning of the assessment, and
the rules they must work within to maintain the
quality of the assessment.
3. Create a clear selection process for language
experts who are to be involved in the process.
Insist that they are assessed for their fit to the
task, and that you reserve the right to deselect a
translator if they do not meet the requirements of
the task.
>
7
Essentially, there are two types of DIF. Uniform DIF refers to a situation where there is a systematic difference across the ability (or trait)
in favour of a given group. Non-uniform DIF is identified from the interaction between ability and language such that different levels of
ability show different probabilities for different language groups (in the context of localisation). This is the most difficult form of item bias
to interpret and to manage. To date, our analyses of ability test items have shown only uniform DIF.
4. Build group work and harmonisation points into
your process. Creating a common set of
objectives and systematic opportunities for
dialogue has paid back more for us than being
slavish in following translation-back translation
techniques. Encouraging discussion and the
sharing of ideas will always result in a better
quality translation.
5. Build lead indicators into your process. Its too
late once you start to collect data if you start
seeing evidence of non-equivalence such as DIF.
DIF, for example, should be a measure of how well
you have specified and managed the localisation
process prior to the point of collecting data across
language versions not as a method to identify
problems with the translation.
6. Make sure that your measures of acceptability
take into account the use of the test score. Our
models are concerned with the differences across
languages when outcomes such as selection ratios
are taken into account. That is, we evaluate the
question of whether the outcomes will be
equivalent or will there be substantial differences
across language groups over and above the talent
that the assessment is designed to assess
8
.
You should always keep in mind the impact on the
assessment user and the candidate.
7. Differences can be OK. The impact depends on
the frequency, size and direction of those
differences, but the impact needs to be evaluated
in relation to how the scores from the assessment
are going to be used.
8. Set practical criteria for declaring equivalence.
An example of the criteria we use is that the
number of differences across items should not be
more than a certain percentage, and that the
impact around cut scores should not increase
beyond a certain tolerance. Knowing what a
successful localisation looks like helps define the
direction and the value from the localisation effort.
9. Refine your processes from the learning you will
gain. We have removed steps as well as added
others as we have developed more experience in
localisation, as well as identified where we can
introduce efficiencies without reducing the
effectiveness of processes (i.e. issues we thought
were important have since been found to be less
critical).
10. Fold the learning from localisation and adaptation
back into your assessment development
programmes. Learning from adaptation should
inform how you create assessment in the future as
you discover what does and does not work in
different languages, as well as content to be
avoided for either linguistic or cultural reasons
(e.g. words that are difficult to find equivalents for
in certain target languages, or stylistic and
syntactical elements that are not essential to the
functioning of assessment content).
>
8
As mentioned earlier, it is possible that differences across groups will be found. The purpose of an effective localisation process is to
minimise the influence of language variations as a contributing factor to such differences and, therefore, to maximise the likelihood that
such differences are a factor of true differences in talent and potential in the specific employment context the assessment is being used
(e.g. the role or job involved, the competencies identified as essential for success, the opportunities available for training and development
to achieve the levels of competence required, and the tasks and responsibilities that characterise the job or role concerned.
www.shl.com
6
0
4
8

Best Practice in Localising and Adapting Assessments in High Stakes Scenarios

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Best Practice in Localising and Adapting Assessments in High Stakes Scenarios

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction

The search for talent is increasingly an international

Anda mungkin juga menyukai