0 penilaian0% menganggap dokumen ini bermanfaat (0 suara)
19 tayangan7 halaman
Companies are moving across geographical borders to develop their businesses. The need for equivalent quality of assessments across languages is therefore more important today than ever before. The purpose of this white paper is to share the learning from a four year programme involving over 700,000 words of content.
Companies are moving across geographical borders to develop their businesses. The need for equivalent quality of assessments across languages is therefore more important today than ever before. The purpose of this white paper is to share the learning from a four year programme involving over 700,000 words of content.
Companies are moving across geographical borders to develop their businesses. The need for equivalent quality of assessments across languages is therefore more important today than ever before. The purpose of this white paper is to share the learning from a four year programme involving over 700,000 words of content.
The search for talent is increasingly an international
effort in the global economy in which we live and work today. What that means for effective assessment of talent is an increasing need to ensure that assessments operate effectively and equivalently in the languages in which the assessments will be delivered. Why? Well, first, companies are moving across geographical borders to develop their businesses and to respond to economic challenges and opportunities. Even if you operate in only one national market, you are still likely to be faced with the need to conduct assessments in a variety of languages as the talent that companies are looking for is also moving across borders in search of employment and development opportunities. The need for equivalent quality of assessments across languages is therefore more important today than ever before. The purpose of this white paper is to share the learning from a four year programme involving over 700,000 words of content and an investment of over 2 million by SHL in developing and evaluating processes for localising and adapting ability tests and personality questionnaires. To gauge whether our efforts have been of a sound scientific quality, we have presented our work through well attended workshops at various leading professional events. Amongst the events we have presented at are the Association of Test Publishers Conference in Dallas, US, in 2008, and in Palm Springs, US, in 2009, the European Association of Work and Organizational Psychology Conference in Stockholm, Sweden, in 2008, and the International Test Commissions Conference held in Liverpool, UK, in 2008, and at the European Congress of Psychology in Oslo, Norway, in 2009. The feedback that we have received from both the testing and translation professions has been very positive, and has encouraged us to share our learning through this paper which we will update as we continue to develop new processes as well as new insights into the critical issues of effective localisation and adaptation of assessments. We would like to acknowledge existing guidelines for test adaptation as provided by the International Testing Commission (ITC 1 ) as well as publications that helped us enormously at the start of our journey four years ago 2 . We would also like to thank those such as Dave Bartram, Ron Hambleton, Robert Roe, John Bateson, Hinrik Johannesson, Dave Wright and many SHL White Paper 2010 Assessment in a global context: Best practice in localising and adapting assessments in high stakes scenarios. Eugene Burke and Carly Vaughan > 1 The current version of the ITCs guidelines for the adaptation of tests can be obtained from http://www.intestcom.org/Guidelines/Adapting+Tests.php which we accessed on February 1st. 2010. 2 Hambleton, R. K., Merenda, P. F, and Spielberger, C. D. (2005). Adapting educational and psychological tests for cross-cultural assessment. Mahwah, N. J.: Lawrence Erlbaum Associates. SHL Group Limited 2010. All rights reserved. SHL and UCF are registered trademarks of SHL Group Limited. www.shl.com SHL White Paper 2010 / Page 2 other colleagues who have helped us in formulating the ideas shared with you in this paper through their suggestions, comments, critiques and questions, as well as helping us to answer some of the difficult questions we have sought to answer on our journey so far. We would also like to thank our translation partners, CLS Communications, for contributing their wealth of experience as well as their patience in taking this journey with us. In particular, we would like to thank Pablo Navascus and Nickola Kakanskas for their industry and attention to quality in supporting our work. We would also encourage you, the reader, to share your reactions, thoughts and comments on this paper with us as we freely acknowledge that this is an area of science and practice that needs continuous effort to improve upon, and that our understanding will develop with the insights of others. Please contact either of the authors to share your comments and thoughts. Their emails are: eugene.burke@shlgroup.com and carly.vaughan@shlgroup.com The importance of localisation to effective multi-language assessment What is localisation and why should it be important to me? Consider a scenario in which assessments (such as ability tests or personality questionnaires but the list could include interviews and observational exercises) are being administered to candidates in several languages. These tests and questionnaires have been designed to reliably distinguish between people in terms of the talents they offer. In this scenario, differences in scores are expected because the assessments have been designed to find those differences, but what if there are patterns in the differences between groups attributed to language or country of origin. The question that then follows is whether those differences in scores are due to true differences in talent (is it a difference due to differences between people) or are those differences just noise resulting from poor localisation (a difference in the language versions of the assessments)? At the risk of oversimplifying, a score is the result of the interaction between a person and a test, questionnaire or some other form of assessment. The effort that goes into building reliable and valid assessments seeks to reduce the amount of variation in scores due to errors or noise by ensuring that the content is designed to function well psychometrically, that instructions are clear and that the process for taking the assessment is easy to understand and to follow. However, no assessment is perfect and there will always be a little noise in the data, though as mentioned, good assessment construction seeks to keep that noise to a minimum 3 . That is, generally, the case for the development of an assessment in one language. But, what happens when we add the complexity of several language versions to the existing demands of ensuring sound, equivalent psychometric quality to the assessment irrespective of the language it is deployed? The danger in localising an assessment into another language is that the process introduces noise such that differences in scores reflect a poor localisation rather than true differences between people. As such, effective localisation processes need to ensure that the assessment measures talent in an equivalent way across different languages, and that any differences between people are meaningful and not just translation noise. Isnt this just about translation of assessment content? You may have noticed that we have only used the term translation once in the past few paragraphs. This is deliberate and, to be clear, translation is not localisation but is one necessary element of an effective localisation process. We will explain this later in this paper, but if an assessment has merely been translated then we would argue that is an insufficient way to mitigate and pre-empt the issues in multi-language assessment that we have briefly outlined above. This is for a very simple reason. Unlike documents or publications, tests and questionnaires and other forms of assessment seek to measure psychological attributes of people such as their level of reasoning ability, their behavioural style or their motivations to use but a few examples. These psychological constructs, what we define and evaluate as the attributes measured by an assessment, require > Assessment in a global context: Best practice in localising and adapting assessments in high stakes scenarios. 3 The reader might think of a reliability coefficient such as internal consistency, alternate forms or, in the context of observational exercises and interviews, interrater agreement or consistency. When looking at language equivalence, the metrics involved are often more complex than a reliability coefficient, but essentially the concern is the same. That is, language versions operate to an equivalent level of psychometric quality. SHL Group Limited 2010. All rights reserved. SHL and UCF are registered trademarks of SHL Group Limited. www.shl.com careful and diligent processes when being localised into another language that ensure key meanings and interpretations have not been changed or distorted. So, what is a localisation process? In our view, it is a process that ensures that variations in the accurate measurement of a construct are kept to a minimum across all language versions of an assessment, and is a process that includes an evaluation of whether any such variation is likely to have a material effect on a persons score and, subsequently, the likelihood of that person realising an employment goal or opportunity (e.g. whether they are selected or are not selected for a job). In this paper, we will focus mainly on ability tests 4 , but the principles that we will describe apply generally to all assessments used in the world of work and that are used to inform employment decisions. Without clear evidence that a localisation process actively and consciously manages the risks of non-equivalence across language versions, then we would argue that the user of multi-language assessments runs the risk of making inconsistent employment decisions. If those inconsistencies are apparent to candidates and stakeholders within the organisation, then those inconsistencies will have consequences in terms of reputation both for the organisation as an employer and for the fairness and the perceived value of the assessment process. An overview of the key steps of the SHL localisation and adaptation process What are we trying to measure? Localisation has to start at the beginning with what the assessment is designed to measure. This may seem a rather obvious statement but is one that in our experience is often overlooked. Increasingly, we have developed content not in one language but in several to test that the constructs we are measuring have the same meaning and functioning across languages and across cultures. It is important to note that the purpose of localisation is not to remove or reduce any differences across languages or cultures that may be sensible and meaningful, but to ensure that the content we use to measure those constructs functions equivalently irrespective of the language it is administered in. Make sure that those involved in localisation have the right skills and experience. We require all translations to be undertaken by native speakers who are professionally qualified as translators, have a minimum of five years experience of translating from the target language back to the original language of the test or questionnaire, and are living in the country of the target language. Why do we insist on the last point? Well, everyone adapts to new countries and cultures after a while, and so we want to ensure that the language used in a localised test or questionnaire reflects the context of the country or countries within which that language is most frequently found. For example, certain terminology that is used in the UK is not always applicable in other languages, however, a translator who has lived in the UK may have become distanced from the current language most often used in the target language and believe that a UK term is applicable or translatable. Make sure that everyone understands his or her role and the tasks that they are responsible for. We take all translators and project managers at the translation agency through a structured training programme in which they are taught which elements of the content are not only critical to the meaning of a question, item or scale in the assessment, but how particular elements of that content are important for how the assessment functions. For example, for verbal reasoning tests, certain words or phrases will be critical to the questions asked. This contributes to the difficulty of those questions. Variations in the language used can therefore affect the difficulty of a question across different languages. The training that we undertake is completed by a short test translation which is reviewed by SHL localisation staff to check that the translator has a good understanding of their task. Translators who do not pass this test translation are replaced by others who then go through the training programme and the test translation. > Assessment in a global context: Best practice in localising and adapting assessments in high stakes scenarios. 4 This is purely for the sake of brevity. SHL White Paper 2010 / Page 3 SHL Group Limited 2010. All rights reserved. SHL and UCF are registered trademarks of SHL Group Limited. www.shl.com SHL White Paper 2008 / Page 4 Identify localisation risks from the start. We begin our process with a localisation risk review in which we require the translators to review content for any words, phrases, sentence structures and expressions that could prove problematic in the target language. This enables us to identify problems before the translation stage begins and helps to identify any content that may need to be adapted to function in the target language. Adaptations are undertaken by trained SHL staff in collaboration with the translators. Check, check and then check again. We require a minimum of two translators per localisation, with each translator responsible for half the content and with each translator reviewing the other translators work to ensure consistency of terminology. This operates on a similar best practice principle to that used in software development where development and checking responsibilities are split between two or more developers to ensure that software is tested in the process of development. This approach also encourages discussion and helps to produce a translation of optimum quality by sharing thoughts and ideas. The translation is then subject to a further review by a translator who has not been involved in the original translation. They too are subject to the same training as described above. Set common goals for everyone involved. All reviews at this stage are then combined through a harmonisation process involving a manager from our SHL team, the project manager and the three translators. At this point, any issues are resolved and, where necessary, further adaptations are undertaken. This is an important stage as it ensures that any changes made to the translations are in line with key psychometric principles relevant to the assessment. And then check again. The translated content is then subjected to another independent review by bilinguals in English (our corporate language at SHL) and the target language. This review is conducted on the test in its final form. For example, if it is an online ability test, the bilingual will take the test on a computer in the format that it is intended to be delivered. This not only helps to ensure that the translation is suitable for the mode of deployment, but also puts the test into a clear context. Any issues at this point are then fed back for further work by the translation team. As this and the preceding steps described above hopefully show, we have engineered our process to place the largest investment in the earliest stages of the localisation process and long before we collect data on the localised content. Our experience is that it is probably too late to rectify problems found at the data collection stage, or that changes at this stage will result in significant costs and delays. Throughout the process outlined above, we have developed a number of lead indicators that enable us to manage the risk of content failing checks at the data collection stage of our localisation process. Evaluate the localised content. Once we have final content ready for trial and data collection, we then administer this content to large samples of native speakers. For example, we have thus far trialled localised versions of Verify 5 reasoning tests with over 9,000 native speakers in Chinese, Dutch, Finnish, French, German, Italian, Norwegian and Swedish, and similar large samples have been used to evaluate the equivalence of different language versions of the Occupational Personality Questionnaire (OPQ 6 ). These evaluations operate as a hard check of measurement equivalence and whether the preceding steps in the localisation process have been effective. The specifics of the analyses conducted to evaluate the equivalence of assessments across languages vary according to the nature of the constructs being measured. What we look for through these analytical models is evidence of whether a question or a scale is biased against a particular language group. That is, for ability test items as an example, whether the probability of answering a question correctly is higher or lower for a particular language group over and above the ability a test is designed to measure (and there are models for evaluating whether measurement of the ability or construct is equivalent prior to focusing on a specific item or question). The models used to evaluate the equivalence of questions or items include a factor for the ability of respective language groups, a factor for language (e.g. English or Arabic) and a factor that looks at the interaction between ability and language (this identifies whether we have a form of bias that is specific to particular combinations of ability and language). When the probability of correctly > Assessment in a global context: Best practice in localising and adapting assessments in high stakes scenarios. 5 You will find details of the Verify Portfolio of Reasoning Tests via www.shl.com/OurScience/TechnicalInformation/Pages/TechnicalManualsandGuides.aspx where a free download of the technical manual is available. 6 The manual for the OPQ can be downloaded for free from the web address provided in the previous footnote. SHL Group Limited 2010. All rights reserved. SHL and UCF are registered trademarks of SHL Group Limited. www.shl.com SHL White Paper 2010 / Page 5 answering an ability question is shown to be associated with language, then this is taken as evidence of Differential Item Functioning or DIF 7 . Differences can be OK. DIF is a fact of life. Attempting to have questions or items that are all free from some variation by language or country is, in our view, impractical. The issue is really the extent to which a bank of questions or items contains items that are substantially biased (show DIF) and the direction of any bias found. Our analysis therefore evaluates the frequency, size and the direction of DIF across all questions. Very often, a bank of ability questions will contain some items that favour one group while other questions favour another group. High levels of DIF would show a need to review and possibly replace questions. To date, the results for Verify suggest levels of between 3% and 6% of item banks showing DIF, with questions showing DIF evenly split in terms of the groups that they favour (effectively cancelling out any effect due to DIF at the overall score level). Look at the material impact of any differences due to language. The final step in our process is to model the impact of differences by language to see whether those differences will have any material effect on outcomes when that assessment is used to inform employment decisions. We will again turn briefly to the case of ability tests. Our models take into account two key aspects of live assessments: the deployment model that will be used or how the content will be administered, and the usage model by a client of which a key aspect is the cut score used to determine outcomes. In the case of Verify, content is deployed using a randomised test model to strengthen test security. In this model, candidates are assigned to different tests containing different questions that are constructed in such a way as to ensure that all candidates receive an equivalent level of difficulty in the test they receive. So, we can evaluate the extent to which candidates are likely to receive items showing DIF under this deployment model. We can also then look at whether worst case scenarios, even if unlikely, will result in a candidate having a lower or indeed higher probability of passing a cut score dependent on the language version that candidate sits. In our evaluations of usage models by language, we have obtained worst case differences due to language version of an additional 0.18% in errors (i.e. increase in the probabilities of false positive or false negatives) or a likelihood of less than 1 in 556 of such an error occurring. Keep monitoring the functioning of your assessments. But our work does not stop there and continues once a test or questionnaire has been released with regular checks on the performance of assessments by industry, job type as well as by demographic including language. The ten principles of effective localisation and adaptation of assessments We offer the following by way of a summary of our journey to date and the learning we have gained thus far. We hope that this summary offers a practical contribution to an essential aspect of assessment in todays global economy: ensuring that assessments offer the same quality and accuracy in identifying a persons talent and potential irrespective of the language in which that potential is assessed. 1. Define the structure of the content to be adapted. Which aspects of assessment content are critical to the functioning of the assessment (fixed constraints they must operate to) and which aspects are less critical (i.e. those aspects with which translators are allowed more freedom to exercise their own personal judgment)? 2. Insist that the translation agency and those responsible for QA are trained to understand the structure and functioning of the assessment, and the rules they must work within to maintain the quality of the assessment. 3. Create a clear selection process for language experts who are to be involved in the process. Insist that they are assessed for their fit to the task, and that you reserve the right to deselect a translator if they do not meet the requirements of the task. > Assessment in a global context: Best practice in localising and adapting assessments in high stakes scenarios. 7 Essentially, there are two types of DIF. Uniform DIF refers to a situation where there is a systematic difference across the ability (or trait) in favour of a given group. Non-uniform DIF is identified from the interaction between ability and language such that different levels of ability show different probabilities for different language groups (in the context of localisation). This is the most difficult form of item bias to interpret and to manage. To date, our analyses of ability test items have shown only uniform DIF. SHL Group Limited 2010. All rights reserved. SHL and UCF are registered trademarks of SHL Group Limited. www.shl.com SHL White Paper 2008 / Page 6 4. Build group work and harmonisation points into your process. Creating a common set of objectives and systematic opportunities for dialogue has paid back more for us than being slavish in following translation-back translation techniques. Encouraging discussion and the sharing of ideas will always result in a better quality translation. 5. Build lead indicators into your process. Its too late once you start to collect data if you start seeing evidence of non-equivalence such as DIF. DIF, for example, should be a measure of how well you have specified and managed the localisation process prior to the point of collecting data across language versions not as a method to identify problems with the translation. 6. Make sure that your measures of acceptability take into account the use of the test score. Our models are concerned with the differences across languages when outcomes such as selection ratios are taken into account. That is, we evaluate the question of whether the outcomes will be equivalent or will there be substantial differences across language groups over and above the talent that the assessment is designed to assess 8 . You should always keep in mind the impact on the assessment user and the candidate. 7. Differences can be OK. The impact depends on the frequency, size and direction of those differences, but the impact needs to be evaluated in relation to how the scores from the assessment are going to be used. 8. Set practical criteria for declaring equivalence. An example of the criteria we use is that the number of differences across items should not be more than a certain percentage, and that the impact around cut scores should not increase beyond a certain tolerance. Knowing what a successful localisation looks like helps define the direction and the value from the localisation effort. 9. Refine your processes from the learning you will gain. We have removed steps as well as added others as we have developed more experience in localisation, as well as identified where we can introduce efficiencies without reducing the effectiveness of processes (i.e. issues we thought were important have since been found to be less critical). 10. Fold the learning from localisation and adaptation back into your assessment development programmes. Learning from adaptation should inform how you create assessment in the future as you discover what does and does not work in different languages, as well as content to be avoided for either linguistic or cultural reasons (e.g. words that are difficult to find equivalents for in certain target languages, or stylistic and syntactical elements that are not essential to the functioning of assessment content). > Assessment in a global context: Best practice in localising and adapting assessments in high stakes scenarios. 8 As mentioned earlier, it is possible that differences across groups will be found. The purpose of an effective localisation process is to minimise the influence of language variations as a contributing factor to such differences and, therefore, to maximise the likelihood that such differences are a factor of true differences in talent and potential in the specific employment context the assessment is being used (e.g. the role or job involved, the competencies identified as essential for success, the opportunities available for training and development to achieve the levels of competence required, and the tasks and responsibilities that characterise the job or role concerned. www.shl.com 6 0 4 8