Cheat-Resistant Multiple-Choice Examinations Using Personalization

Computers & Education 130 (2019) 139–151
Contents lists available at ScienceDirect
Computers & Education

journal homepage: www.elsevier.com/locate/compedu
Cheat-resistant multiple-choice examinations using personalization

T
Sathiamoorthy Manoharan
Department of Computer Science, University of Auckland, New Zealand
ARTICLE INFO ABSTRACT
Keywords: Multiple-choice examinations offer the ability to grade quickly as well as being able to assess
Academic dishonesty concepts and understanding in a wide range of subjects. Consequently, many large classes use
Cheat-resistant assessment multiple-choice examinations. One problem, however, is that multiple-choice examinations are
Learning environment more prone to cheating than constructed-response style examinations. Multiple-choice ex-
Student experience
aminations offer limited answer options, and these limited options can lead to sharing answers
Student assessment
through collusion or gleaning answers from unwitting peers. To counter such cheating, this paper
investigates a personalization approach to examinations whereby every student gets their own
version of the examination that is different to the rest of their peers. Such personalization ap-
proach not only counters cheating, but also encourages students to focus on concepts rather than
just answers. A software framework that facilitates generating personalized examination papers
is developed, and the paper reports on the experience of using the approach in large classes. It
discusses the administrative, technical, and pedagogical challenges posed by personalization and
how these challenges might be overcome using the framework as well as accompanying pro-
cesses. Surveys indicate that both students and staff are positive about using such a system.
1. Introduction
Multiple-choice examinations are widely used in large classes to make scoring manageable, and to enable measurement of
knowledge and competencies at the same time (Steven & Downing, 2006). To be an effective assessment tool, the questions of a
multiple-choice examination need to be well-designed, taking into account the respective levels of the Bloom's taxonomy (Steven &
Downing, 2006; Young & Shawl, 2013).
One of the major drawbacks of multiple-choice examinations, even if well-designed, is that they are prone to cheating. While
examiners place measures to make direct copying difficult, for example, by having multiple versions of the examination script where
the answers and/or questions are shuffled, strategies that circumvent these measures have been observed. These strategies include
some form of coded communication between a group of participating students. For example, Fig. 1 illustrates an observed strategy of
writing a code that uniquely identifies an answer in big letters so that students sitting close enough would collude by observing what
each other has written down. The answer option (A–E) may differ between students, but the absolute answer remains the same across
them. Consequently, the strategy works well even with multiple versions of the examination script.
Such collusion has been repeatedly observed in our university, and the research work reported herein is a result of trying to
mitigate such collusion. We use personalization to render any blind copying in the examination ineffective. While personalization has
been used in other contexts such as targeted instruction and adaptive learning (Weldet al, 2012; ZareC Stephanidis, 2011) and even
plagiarism mitigation (Manoharan, 2017), personalization in the context of examinations, in particular, multiple-choice examina-
tions, poses its own challenges, and it is those challenges that make this research interesting. These challenges fall under three
E-mail address: mano@cs.auckland.ac.nz.
https://doi.org/10.1016/j.compedu.2018.11.007
Received 3 March 2018; Received in revised form 17 November 2018; Accepted 19 November 2018
Available online 22 November 2018
0360-1315/ © 2018 Elsevier Ltd. All rights reserved.
S. Manoharan Computers & Education 130 (2019) 139–151
Fig. 1. Collusion by scribing answer codes in big letters.
categories: technical, pedagogical, and administrative.

The main technical challenge is to construct a generic framework that could help automate the process of generating personalized
examination scripts as well as scoring students’ responses. The main pedagogical challenge is to ensure fairness across the ex-
amination scripts. A typical personalized quiz or assessment would only require the system to produce the correct answer. In a
personalized multiple-choice examination, however, each question needs to be provided with not only the correct answer but also a
sufficient number of incorrect answer options, i.e., distractors.
This paper addresses the following two related research questions.
1. Would it be possible to architect a generic framework to enable personalized multiple-choice examinations?

2. What are the technical, pedagogical, and administrative challenges posed by personalization, and how these might be overcome?
Personalized assignments have been widely studied and used in a number of other works (Kashyet al, 1993; Kumar, 2013;
Manoharan, 2017; Smaill, 2005). Therefore, both of these research questions have already been partially answered. The main
contribution of this paper is to apply similar concepts in the context of multiple-choice examinations and to discuss the challenges
specific to these examinations.
The rest of this paper is organized as follows. Section 2 discusses the broad issues of cheating in examinations and personalized
assessment, both in the light of related work. Section 3 discusses the challenges in developing personalized examination scripts.
Section 4 discusses the architecture of a software framework that is able to generate and score personalized examinations. Section 5
evaluates the framework in the light of its ability to reduce cheating. The evaluation includes results from student surveys as well as a
staff survey. Section 6 shares the experience of using the system, and its advantages and disadvantages. The final section summarizes
and concludes the paper.
2. Background and related work
Zobel reported on a long-standing and well-organized cheating racquet in 2004 (Zobel, 2004). The so-called ‘my tutor’ was used
not only to contract-cheat in assignments but also to ghost-write in examinations. The case highlighted the prevalence of cheating as
well as the effort required to investigate cheating incidents and to bring those students and contractors to justice.
Nearly fifteen years on, and with almost all academic institutions having in place compulsory tuition on academic honesty, has
cheating reduced? This is not an easy question to answer, but our observation is that academic institutions still come across more
cases of cheating than they would like to. It is estimated that about 70% of students admit to some cheating (Broeckelman-Post,
2008).
Contract cheating has become a lot more prevalent (Clarke & Lancaster, 2006; Walker & Townley, 2012). While Zobel's ‘my tutor’
case advertised on local notice boards, modern contractors advertise online, including social media. There are a large number of
online sites offering a variety of contract services ranging from simple assignment solutions to writing complete doctoral theses.
These sites even guarantee that their work is “plagiarism free”. Prevalence of contract cheating has prompted national education
bodies to develop policies that aim to mitigate such cheating (The Quality Assurance Agency for Higher Education, 2017). One such
policy stipulates that students visiting “cheating” sites from within academic institutions be redirected to a site that promotes aca-
demic honesty (The Quality Assurance Agency for Higher Education, 2017).
Advances in technology has made cheating detection easier, but also the same technology has made cheating easier. Students use
not only the time-tried cheat-sheets and invisible inks, but also miniature cameras and communication devices. To counter the
potential use of these devices, some of the examination centres in China jam wireless signals blocking any potential over-the-air
cheating.
While contract ghost-writing and the use of cheating devices in examinations seem glamorous, the most employed forms of
cheating that we have observed are the following three:
140
1. Premeditated collusion by writing answers in big letters

2. Discreetly observing the answers of surrounding (potentially unwitting) students, often known as wandering eyes
3. Using a cheat-sheet (though some examinations render cheat-sheets irrelevant through open-book policies or by providing an
official cheat-sheet containing what the students need to lookup).
Through an extensive student survey, Shon classified the common cheating strategies students employ (Shon, 2006). Their
classification includes collusion, gleaning answers from unwitting peers, use of technology, and taking advantage of the behavioural
and/or psychological profiles of invigilators.
Cheating in multiple choice examinations becomes somewhat easier than in an examination where questions use free-format
answers, or constructed-responses. This is because the information that needs to be gleaned from other students, willing or unwitting,
is much smaller in a multiple-choice examination. In the most trivial case, it would simply be an answer option (such as one of A–E).
Consequently, most universities shuffle answer options and/or questions to make copying difficult. However, as illustrated in
Fig. 1, shuffling answer options might not deter students determined to cheat. Randomized seating would mitigate premeditated
collusion, but in many universities exam rooms have no seat numbers to facilitate randomized seating. While shuffling the answer
options can be easily automated, shuffling questions need to be a manual process in general – this is because the ordering of questions
would typically follow a topic order. Therefore, many examinations opt to shuffle only the answer options.
In most seating arrangements, students are able to observe the answers chosen by surrounding peers. This is due to the lack of
space. In addition, where examinations are conducted in lecture halls that are typically sloped towards the podium, it is easy to note
the answers of students sitting forward.
Where there are multiple versions of the examination scripts, the scripts are distributed in such a way that no student sits next to
another student with the same version. To help this distribution, it is common practice to colour code the scripts. However, the colour
coding also enables a dishonest student to seek answers from a student with a matching script.
Would it be possible to quantitatively assess if there was cheating in a multiple-choice examination? Marx and Longer had argued
that it wouldn't be (Marx and Longer, 1986), but there is more recent work that deals with statistical analysis of student answers to
detect potential collusions (Wesolowsky, 2000; Ercole, Whittlestone, Melvin, & Rashbass, 2002; Richmond & Roehner, 2015; D'Souza
& Siegfeldt, 2017). Legal aspects of how any such detected collusion could be treated is unclear unless there is other physical evidence
to substantiate the statistical measures.
Dealing with suspected cases of cheating is a time-consuming exercise. Physical evidence and witnesses need to be obtained and
retained through the formal process of trying the students involved. In addition, statistical evidence may also be generated to
supplement the physical one.
The approach this paper takes is to remove the opportunity to cheat through collusion and discreet observation. To this end, the
examination scripts are personalized so that no two students get the same questions. The concept of personalization in assessment is
not new. It has been used in the context of targeted instruction and adaptive learning (Weldet al, 2012; Zare, 2011) as well as in the
context of mitigating plagiarism (Manoharan, 2017).
Typically, personalized assessments use three approaches.
1. Parameterization – substitute parameters with appropriate random values.

2. Databank – randomly choose items from a large databank of questions.
3. Macro – replace macros (a program fragment) in questions with the result of executing the macros.
Problets, which generates short computer programming questions, takes the parameterization approach to generate fragments of
computer programs which the students are asked to analyze (Kumar, 2013). Abaligeti and Kehl take a similar parameterization
approach to individualize examination scripts (Abaligeti & Kehl, 2018). OASIS, a personalized engineering quiz, takes the databank
approach to select quizzes from a database, but it also substitutes parameters in the selected quizzes (Smaill, 2005). The personalized
assignments discussed by Manoharan use a macro approach and therefore achieve a high level of complexity and customization that
are not possible via parameterization alone (Manoharan, 2017). It is the macro approach this paper takes to personalize multiple-
choice examinations, for this approach gives the instructor a lot of freedom and flexibility to formulate questions. The downside,
however, is that the instructor needs basic programming skills to write the macros.
3. Challenges
While it appears that personalized multiple-choice examinations can overcome the main strategies of cheating, crafting perso-
nalized questions poses a number of challenges. These challenges can be broadly classified into administrative, technical, and
pedagogical challenges.
This section discusses these challenges and how they can possibly be addressed.
3.1. Administrative challenges
Multiple-choice questions are typically answered on an optical answer sheet, also known as “bubble sheet” or “Scantron sheet”
(named after the well-known company Scantron Corporation commercializing optical answer sheets).
The Scantron sheet our university uses has fields for student name, a numeric student ID, an alphanumeric course ID, and a
141
numeric version ID (which distinguishes multiple versions of the same examination), and fields to bubble in the chosen answer
options.
3.1.1. Examination delivery

When the examination scripts are auto-generated, it is logical to consider digital delivery options. However, large class sizes (e.g.,
hundreds of students) and rules surrounding examination administration might make a digital examination unsuitable. Therefore,
personalized examination scripts would need to be printed and distributed to students just like standard examination scripts.
3.1.2. Student and script identification

Personalization requires uniquely identifying each student and their examination script. Given that each student already has an
ID, the first thought was to use this to generate a personalized examination script. However, with large classes it will take a long time
to physically distribute the scripts since we need to provide to each student their specific printed script.
To save distribution time, we would like any student to pick up any script and note down a script ID on their Scantron sheet. The
student ID and the script ID will then map a script to a student. The script ID is essentially the examination version number, and we
use as many versions as there are students.
The scanning centre at the university supplies Scantron sheets and they always pre-print four version numbers in colour-coded
Scantron sheets. They will not supply Scantron sheets with blank version numbers. While using the examination version number on
the Scantron sheet is the best solution for script identification, it is a solution that cannot be used due to our administrative re-
striction.
The next best solution is to encode the script ID in the answers to the first few questions. We use five answer options per question,
and therefore each question can encode five values. Using the first n questions to encode the script ID therefore gives us 5n different
versions. If the class size is m, given that we need as many versions as the class size, we need to use log5m questions to encode the
script ID.
The script IDs include check digits standardized by ISO/IEC 7064:2003(E) so that incorrect ID entries can be identified.
3.1.3. Scoring and script identification

Scoring software that marks a completed Scantron sheet requires the version number and the correct answers corresponding to
the version number. Given that the real version number is encoded in the answers to the first few questions, we need a preprocessing
facility that replaces the pre-printed dummy version numbers (1–4) by the script IDs. With this replacement, the scoring software can
transparently mark the prepared Scantron scan results.
3.2. Technical challenges
Multiple-choice questions belong to a generic class of questions known commonly as selected-response questions where the
student chooses one or many of supplied answer options. This is in contrast to constructed-response questions where the students
write free-format answers (such as short answers or essays).
Selected-response questions have different types (Haladyna, Downing, & Rodriguez, 2002).
1. Multiple-choice single-response. Most of the examinations use this type where the student would choose just one of the available
answer options. In this case, only one of the options is the correct answer and the others are distractors.
2. Multiple-choice multiple-response. This applies when there are more than one possible correct answer options, and the student is
expected to select them all.
3. Complex multiple-choice or XYZ questions. These are multiple-choice single-response questions but with a secondary level. The
sample question in Fig. 1 belongs to this class. More than one or none of the X, Y, Z options could be correct.
4. Group. This is technically not a question, but puts together a number of related questions, often with a common context. A student
answering the group of questions will first need to understand the common context which may supply information that applies to
all the questions in the group.
5. True/false or dichotomy. In this case, there are two answer options available and one of them is correct. This is technically a
subset of the multiple-choice single-response class. Three related dichotomy questions can be combined into an XYZ question so as
to have more than two answer options. True/false questions can also be extended to have more than two answer options, but still
ensuring that only of the options is correct.
3.2.1. Supporting multiple types of selected-response questions

The main technical challenge is to construct a generic framework that supports these different types of selected-response ques-
tions. Of these, XYZ questions deserve special mention.
Writing plausible distractors is a time-consuming task, and therefore some suggest having a small number (typically 2) of
plausible distractors instead of having more but lower quality distractors (Gierl, Bulut, Guo, & Zhang, 2017). However, this has the
negative effect of being able to guess the answer with a high (e.g., 33%) probability of getting it right. An XYZ question allows one to
have a small number of plausible distractors while making it possible to lower the probability to guess the correct answer – this is
because there are eight possible answer options to choose from (see Fig. 2). One of the requirements, therefore, is that the framework
should make it easy to generate XYZ questions.
142
Fig. 2. Truth table for XYZ questions.
3.2.2. Duplicate answer options

None of the answer options should be duplicated. While this seems obvious, when programmatically generating answer options, it
might be possible to get duplicate distractors or even get the correct answer as one of the distractors. The framework should reject
questions if there are duplicate answer options.
3.2.3. Registering correct answers

There needs to be a facility to easily register the correct answer option for each generated question. These expected answer
options can then be used for scoring student responses.
3.2.4. Accounting for errors

While due care is usually taken in ensuring the correctness of the questions and answer options, sometimes mistakes do occur.
This may require us to adjust the scoring. There are three possible cases: (1) a distractor may prove to be the correct answer, (2) there
may be more than one correct answer, or (3) there may not be a correct answer. The answer generation facility therefore should be
flexible enough to take these cases into account post-examination.
In addition, students are required to keep the examination script until the marking is complete, so that if they made mistakes in
entering the script ID on the Scantron sheet, the error can be corrected post-examination.
3.3. Pedagogical challenges
3.3.1. Ensuring fairness

A personalized examination gives every student their own question. It is therefore important to ensure that any question across
the different versions of the examination paper maintains the same level of difficulty and tests the same learning outcome. For
instance, the following two examples do not meet these requirements.
1. Consider the parameterized question “What is the sum of $a and $b?” where $a and $b are parameterized integer values. This
question could lead to generating “What is the sum of 23 and 48?” and “What is the sum of 5238 and 2639?”. While they both test
the same learning outcome of being able to add, the time it may take to sum four-digit numbers will be more than the time it takes
to sum two-digit numbers. Therefore, the two generated questions aren't fair.
2. Consider the parameterized question “What is the result of $a $op $b?” where $a and $b are parameterized integer values and $op
is a parameterized arithmetic operator. This question could lead to generating “What is the result of 42 + 74?” and “What is the
result of 32 × 53?“. The generated questions test two different learning outcomes – one being addition, and the other being
multiplication – and have different levels of difficulty.
Constraints need to be placed on parameters and macros so that they do not lend themselves to generating questions that have
different levels of difficulty or different learning outcomes.
All questions need to be peer-reviewed not only for correctness but also for fairness.
3.3.2. Distractors
Auto-generation of plausible distractors is another major pedagogical challenge. Distractors are generally based on exploiting the
common misconceptions of a typical learner, or based on forming incorrect options that are similar to the correct option. The latter
143
can lend itself to automation through extracting features of the correct option and using a subset of the features to form incorrect
options (Lai et al., 2016).
3.3.3. Coverage of learning outcomes

A personalized examination is programmatically generated. It may prove to be hard to translate the assessment of all learning
outcomes to this model of programmatically-generated questions. For example, macros and parameterization can only work in
assessing certain concepts (e.g., arithmetic expressions). Where they might not work, the framework should support traditional non-
macro, non-parameterized questions that are still personalized. The XYZ questions and true/false questions can help to fill this gap.
3.3.4. Quality assurance

All examination papers are peer-reviewed before the papers go to print. With personalized examinations, however, the peer-
review process requires additional work since questions are different in every examination script. This requires fairness to be added to
the standard set of review rubrics. Besides, a reviewer cannot be expected to review every single script; instead, the reviewer will
need to review the source of the generated scripts as well as a random sample of scripts generated from this source. The reviewer also
needs to check if the correct answers and the generated answers match.
For example, consider the parameterized question “What is the sum of $a and $b?” where $a and $b are parameterized integer
values. A reviewer will need to review this parameterized question as well as a random sample of questions generated from this
parameterized question. The reviewer will also need to check if the range of values generated for $a and $b is not too broad so as to
induce different levels of question difficulty leading to unfairness.
4. Software framework
The framework1 requires the examination specification be written as an HTML template with macros. The macros are functions
that an instructor will have to define. The framework has a macro processor that takes as its input an HTML template and a library
consisting of the instructor-defined macros, and outputs HTML examination scripts where the macros have been substituted by the
result of executing the macros. See Fig. 3.
Use of HTML for the template allows the instructor to develop the bulk of the examination paper such as cover page, appendices,
and possibly parts of question stems and/or answer options in HTML and supplement them with the code in the macros. It also allows
a limited preview, without any macro substitution, of the template through any browser. HTML examination scripts will permit
digital delivery if applicable, or the scripts can be printed out for traditional paper-based examinations.
Fig. 4 illustrates a sample HTML template with a small set of sample questions. Macros are demarcated by a CSS (Meyer & Weyl,
2017) class cws_code_q. The macro processor will identify these macros, execute the macro code, and replace the macros with the
result of the execution. A macro beginning with $ has special significance within the macro processor. For example, $n is replaced
with the nth digit of the script ID (expressed in base 5 using the alphabets A–E). The first few questions in the script instruct the
students to choose the supplied answers which form the script ID. A macro that does not start with $ is instructor-defined. These
macros, when executed, return HTML fragments that replace the macro. The sample HTML template shows the use of three in-
structor-defined macros: GetElvishLanguages, GetThorinsCompany, and GetApplesAndOranges. The first two macros return answer
options to the respective questions while the third one returns both the question stem as well as answer options.
See Fig. 5 which illustrates a sample (partial) examination script generated by the macro processor.
The question on Elvish languages is a true/false question, where the question stem is defined in the template and the instructor-
defined macro returns answer options. In the macro, the instructor would supply a pool of true statements on the topic as well as a
pool of false statements. The framework has a built-in truth question type which would pick one true statement and four false
statements from the instructor-supplied pools. These statements are then shuffled to form the answer options. To pick a false
statement as the answer, the instructor would reverse the roles of true and false statements they supply.
The question on Thorin's company is an XYZ question where three true/false statements are involved. As in the truth question
type, the instructor would supply a pool of true statements as well as a pool of false statements on the topic. Recall that an XYZ
question can have eight possible answer options. Given that we only allow five answer options in the examination, some of the eight
possible answer options are combined. See Table 1 that illustrates two ways of combining them to make five answer options.
The framework has a built-in XYZ question type which first randomly chooses one of the possible option pools as illustrated in
Table 1. It then randomly chooses one of the answer options as the correct answer. Based on this choice, it would then pick an
appropriate mix of true and false statements from the instructor-supplied pools to assign to X, Y, and Z. This approach gives all answer
options equal probability.
Since the correct answer option is internally chosen in the XYZ question as well as the truth question, both question types register
the correct answer automatically. Recall that scoring requires that the correct answer be registered.
The logic for the apples and oranges question will be completely defined by the instructor who created the macro. Unlike the
previous two macros, this macro emits the entire question as well as the answer options. This illustrates the power of macros, and
how such power can lead to producing highly complex questions. The number of apples and oranges and Castar amounts will differ
from script to script, thus requiring every student to work out their own answer. The macro would look at common mistakes in the
1
The framework is available for download from www.dividni.com.
144
Fig. 3. HTML macro processor.
Fig. 4. Sample HTML template (partial).
solution and produce plausible distractors (e.g., off-by-one answers, answers that swap apples and oranges, etc.). To ensure the same
level of difficulty across the scripts, the number of apples and oranges should be kept within a reasonable range (e.g., 2–20). There is
only one correct answer to this question, and therefore its design can use the truth question type: the macro could simply add the
correct answer to the pool of true statements, and the distractors to the pool of false statements.
While the framework supports programmatically creating any complex question, it also allows instructors to create true/false and
XYZ questions without any programming. True/false and XYZ questions are generic and are applicable across many disciplines. Being
able to write them without requiring any programming knowledge, therefore, enables wider use of the framework.
To form true/false and XYZ questions, the instructor would simply supply the true and false statement pools in an XML format.
Fig. 6 shows the XML specification of the question on Thorin's company. The XML content is converted into the macro code by the
framework. The attribute type specifies whether the question is an XYZ question or a true/false question, while the attribute id
provides a question id that is to be used in the HTML template.
Note that the XML specification includes the question stem. The framework supports having the question stem either in the HTML
template or in the XML specification – the suggested option is the latter since it allows an instructor to review a question solely by
viewing the corresponding XML file (e.g., in a browser).
145
Fig. 5. Examination script (partial) corresponding to the template of Fig. 4.
Table 1
Two possible answer option pools for XYZ questions.
Type 1 Type 2
X and Y only X only

X and Z only Y only
Y and Z only Z only
All of X, Y, and Z None of X, Y, and Z
None, or only one of X, Y, and Z All, or two of X, Y, and Z
5. Evaluation
5.1. Research methodology
The research questions relate to the feasibility of constructing a generic framework to support personalized examinations and the
challenges posed by personalization. Loosely speaking, both of the questions are answered positively through the successful con-
struction of the framework and its continued use in examinations and in-class tests. The research methodology we followed to
evaluate and substantiate these claims are as follows.
We trialled the system in an assessed in-class test. We collected anonymous feedback from the students on their perception of how
resilient the test was to cheating, and if they thought the test was fair. We also had a question on their view of using personalized tests
in other courses, as well as a question on their overall liking of such tests. The questions used the standard 5-point Likert scale. In
146
Fig. 6. Sample XML specification of the question on Thorin's company.
addition, the students were able to provide open-ended feedback. The survey was open to all students in the course.
We also evaluated the system with a small group of staff who compared personalized tests and standard 4-version tests in the light
of cheating. We also questioned staff for the amount of time they were prepared to spend developing personalized tests.
We compared the exam performance of students in two offerings of the same course – the first exam used a standard 4-version
multiple-choice questions, while the second used personalized questions.
5.2. Initial trial
The system was first trialed in a third year computer science class which had just over 400 students. This was a for-credit
supervised test conducted in-class.
Ten staff members reviewed the test script, and two of them reviewed the source macros. Reviewing the source macros is
important because they effectively generate the scripts. In particular, the true and false statement pools of the truth questions and
XYZ questions need peer review.
In addition, two teaching assistants checked a random sample of scripts to verify that the auto-generated answers were indeed
correct answers.
On completion of the test, an anonymous online survey was conducted of the whole class. The response rate was around 30%.2
The summary of the survey results is listed in Table 2. The results show that a large proportion of the students who responded to
the survey are in favour of personalized tests, and over 80% agree that personalization reduces the level of cheating. There was some
skepticism over fairness and this is reflected in the responses to our second question.
5.2.1. Post-examination issues

One of the challenges mentioned earlier is that the framework should be flexible enough to help with fixing errors in the gen-
erated scripts post-examination. This section describes some of the post-test issues we had.
Even though, we had checked the answers in a random sample of test scripts, a bug in the answer generation of a group of three
questions went undetected. This had the following effect: 138 students were potentially marked incorrectly in one question, 45
students in two questions, and 9 students in all three questions. A temporary patch to the macros in the three affected questions
allowed us to generate the correct answers and re-score the test.
One student copied the script ID incorrectly: BEBC was entered as DEBC. Another student did not enter the script ID at all. We had
to manually adjust/enter the ID in the Scantron result and re-score these students. Since the students were required to retain the test
script until the marking is complete, fixing the IDs was easy.
Two students had entered the same version number. This could mean that they colluded, one of them copied the answers from the
other, or one of them entered the version number incorrectly. In this instance, it turned out to be the latter. When two or more answer
scripts share the same version number, examining the corresponding students’ question scripts will give insight into possible
cheating.
2
The response rate was low, but this is the norm in most of our undergraduate courses.
147
Table 2
Student evaluation results of the personalized mid-semester test (2017, semester 2). SD: strongly disagree; D: disagree; N: neutral; A: agree; SA:
strongly agree.
Question SD D N A SA
(%) (%) (%) (%) (%)
Personalized tests help to reduce the level of cheating 3 3 4 32 58

Personalized tests ensure everyone needs to work equally hard to achieve good scores 10 12 17 30 31
It would be a good idea to have personalized tests in other courses 9 9 13 39 30
Overall, I like the idea of personalized tests 8 11 14 35 32
5.3. Staff survey
We also evaluated the system with a group of staff (which also included staff from other departments such as Mathematics and
Medical Sciences).
The group attempted a short standard 4-version test under test conditions, but with the view of cheating. The group observed that
standard test seating arrangements allow one to see the answer options chosen by those around them if the options are marked in the
test script. Even if the options could not be read, the chosen options could be inferred by some characteristics of the chosen options
(e.g., length of an option or any other prominent feature). The group noted that it was also possible to collude using the strategy
illustrated earlier in Fig. 1, and such collusion was much easier than trying to glean answers from unwitting neighbours.
The group then attempted to cheat in a short, personalized test and found that it was not possible unless they were allowed to
discuss the concepts and answer options with others.
One of the major downsides of personalized examination is that an instructor will need to spend more time developing questions.
The extra time is attributed to (1) developing pools of true/false statements, and (2) writing macros. We conducted a survey among a
small group of staff to see how much extra time they might be willing to spend developing personalized examinations. All but one of
the staff members were prepared to spend more time (2–3 times more), and all of them strongly agreed that personalization is helpful
to reduce the level of cheating.
5.4. Further feedback
Considering the experience and feedback from the initial trial, the system was rolled-out for use in other tests and examinations.
Courses that currently use the system include digital security, computer networks, software development, and web applications. The
class sizes range from about 300 to just over 400 students. While the tests are run in-class supervised by instructors and teaching
assistants, the examinations are run centrally by the University. The administrative head of examinations had this to say on the
personalized examinations: “Only positive feedback. No student or supervisor issues. It seems to have run very smoothly, so happy to
continue with this.”
Free-format feedback from students, collected over a number of courses, was generally positive. Most of the comments related to
cheating, and commended the ability to combat cheating. Some of these comments are:
1. “Less fear of cheating comrades next to me.”

2. “No matter what, looking at someone else's working is meaningless to you.”
3. “I like the personalized tests, they mean that someone can't just look over and copy off you if you're doing the same version of the
test, there is no way of knowing.”
4. “The bro next to me can't really gain anything from cheating”
Personalized examinations not only help to mitigate cheating, but also encourages students to focus on concepts rather than just
answers. Many students commented on how the system contributed to positive learning:
1. “When debriefing with my mates after the exam, we focus on the content not the raw answer.”
2. “Had to make sure to understand the concepts not just memorize answers”
3. “It encouraged knowledge of the content rather than knowledge of the question”
4. “A few times people posted their version of the question on Piazza, which meant everyone else got the benefit of having an extra
version of the question to practice”
5. “Makes you understand concepts better rather than rote learning stuff”
6. “After the test people were posting their questions on Piazza asking for help/explanations etc., this had the positive side effect that
everyone reading got a new version of the question to practice, which is more fun than just going over one version of the question
repeatedly.”
There were also genuine concerns about the fairness of the system:
148
Table 3
Grade distributions in two courses.
Grade Course 1 Course 2
2017 2018 2017 2018
A+ 20 16 30 15
A 21 22 19 12
A- 34 22 23 15
B+ 39 25 21 20
B 72 42 29 27
B- 65 43 29 26
C+ 57 39 43 36
C 22 29 33 41
C- 6 16 17 32
F 81 56 78 72
Total 417 310 322 296
1. “Some people may get hard versions which take up more time to do. However, if you knew the topic really well, it wouldn't affect
too much.”
2. “That there is an RNG factor – if I sit in one seat I could get better marks than if I sat in another seat because of the choices
available”
3. “Not a level playing field if some of the questions are randomly assigned to be harder than others.”
4. “Difficulty of questions can vary from version to version.”
While personalization has the potential to introduce unfairness, its reduction of cheating increases fairness. One student com-
mented “No one would be able to cheat, so in that sense [personalization] made it fair”.
5.5. Impact on student performance
Two of our courses used personalized examinations in 2018. Table 3 shows the grade distributions in the two courses across two
years: 2017 used standard 4-version multiple-choice examinations while 2018 used personalized multiple-choice examinations.
We performed Pearson's ˜ 2 tests in the two sets of ordinal data to determine whether there was a significant difference between
the two years in the two courses. The proportional distributions of grades are not exactly the same in the two years, but the ˜ 2 tests
show no evidence of statistically significant differences.
Pearson's ˜ 2 tests yield a p-value of 0.05659 for Course 1 and a p-value of 0.1307 for Course 2. Both p-values are above 0.05, so
not statistically significant, notwithstanding some weak evidence of an association between years and grade composition.
Personalization therefore is not likely to have caused any statistically significant difference.
5.6. Post-examination issues
Post-examination issues similar to what was already observed in the trial continue.
In spite of the extensive reviews, some of the tests and one examination had one or two questions with incorrectly marked
answers. The macros needed to be patched up post-examination to correct these issues. The framework allows convenient me-
chanisms to interrogate the answer options generated for each script and to take corrective measures in case of errors – after patching
up the macros, expected answer options are re-created.
The framework picked up five pairs of students using the same version number. While two of the pairs turned out to have a
student who incorrectly filled their version number, the other three pairs had strong traits of cheating – they had lot more answer
similarities than that was statistically expected. In each of these three pairs, one of the students used the other student's version
number and most of their answers. One of these pairs has been investigated by the University disciplinary committee and it was found
that one student had indeed copied from the other unwitting student – the former was unable to produce their examination script
with their unique version number, while the latter was able to. The student who copied has subsequently been disciplined. The other
two pairs are under investigation.
6. Discussion
The successful development of the framework for personalized examinations, and its use within several large classes positively
answered our first research question. These classes were in a number of different areas such as software development, computer
networks, and digital security. The framework is therefore seen to be generic enough to span multiple areas. In addition, the truth and
XYZ question types truly allow the framework to be used in many domains and by instructors less comfortable with programming.
To answer our second research question, we discussed the challenges posed by personalization and addressed these challenges.
Some of these challenges were addressed by the framework, while the others were addressed by the processes we followed (such as
149
Table 4
Challenges and solutions – summary.
Challenge Addressed by …
Framework Process
Examination Delivery ✓
Student and Script Identification ✓
Scoring and Script Identification ✓
Supporting Multiple Types of
Selected-response Questions ✓
Duplicate Answer Options ✓
Accounting for Errors ✓ ✓
Fairness ✓ ✓
Distractors ✓ ✓
Coverage of Learning Outcomes ✓
Quality Assurance ✓
peer-review). Some challenges required the help of the framework as well as required us to follow processes. Table 4 summarizes the
challenges and how they were addressed.
Personalized examinations allow an instructor to re-use a number of questions; this is because personalization ensures that the
students will have different values to work with – students cannot simply memorize answers from a previous examination. In fact, in
two of our classes we ran repeat tests – tests that had the same questions as the first test but with different data sets. This allowed the
students to study the areas where they had gaps, and get better scores the second time. One of the students said of the repeat test: “I
was thankful for the repeat test as it allowed me to improve greatly and was able to understand the concepts better”.
In addition, personalized examinations enable the administration to have less space for conducting the examinations: there is no
need to have large gaps between students. One of the students wrote this in the free-format feedback: “You can pack way more
sardines into the can this way”. Note that the examinations are still supervised and students are not allowed to talk to each other or
share information.
Personalization has the potential to introduce unfairness in the assessment. It is therefore paramount that fairness is taken into
account when developing questions, and the questions are carefully reviewed for fairness before they are used.
Personalization helps to reduce cheating by collusion as well as copying from unwitting peers, but it does not help to reduce other
cheating means such as contract-cheating or the use of modern communication devices inside the examination room. These need to
be handled the traditional way. For example, effective identification checks could mitigate ghost-writing in examinations.
Used in conjunction with typical test conditions (such as ID checks, ban on electronic devices, restrictions on personal items, etc.),
personalized examinations can be an effective mechanism to reduce cheating incidents.
Personalized examinations not only help to mitigate cheating, but also encourage students to focus on concepts rather than
answers. Student feedback reported in section 5 attests to this.
7. Summary and conclusions
Cheating in examinations is an ongoing issue, especially when the stakes are high. This paper proposed an approach to reduce the
level of cheating in multiple-choice examinations that many large classes use. The approach is based on personalization that creates
as many versions of the examination script as there are students. Consequently, blindly copying the answers from other students will
not help a student score better than just guessing. A software framework that supports personalized examination is developed and its
design and use are discussed. We use the approach in large classes with sizes ranging from about 300 to 400 students. Our experience
suggests that not only does personalization counters cheating, but it also encourages students to focus more on concepts than mere
answers.
References
Abaligeti, G., & Kehl, D. (2018). Personalized exams in probability and statistics. Proceedings of challenges and innovations in statistics education multiplier conference (pp.
4). ISBN: 978-963-306-575-4.
Broeckelman-Post, M. (2008). Faculty and student classroom influences on academic dishonesty. IEEE Transactions on Education, 51(2), 206–211. https://doi.org/10.
1109/TE.2007.910428.
Clarke, R., & Lancaster, T. (2006). Eliminating the successor to plagiarism? Identifying the usage of contract cheating sites. Proceedings of the 2nd international plagiarism
conference.
D'Souza, K. A., & Siegfeldt, D. V. (2017). A conceptual framework for detecting cheating in online and take-home exams. Decision Sciences Journal of Innovative
Education, 15(4), 370–391. https://doi.org/10.1111/dsji.12140.
Ercole, A., Whittlestone, K. D., Melvin, D. G., & Rashbass, J. (2002). Collusion detection in multiple choice examinations. Medical Education, 36(2), 166–172. https://
doi.org/10.1046/j.1365-2923.2002.01068.x.
Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review
of Educational Research, 87(6), 1082–1116. https://doi.org/10.3102/0034654317726529.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in
Education, 15(3), 309–333. https://doi.org/10.1207/S15324818AME1503.5.
150
Information technology - Security techniques - Check character systems. (2003). Geneva, Switzerland: Standard, International Organization for Standardization.
Kashy, E., et al. (1993). CAPA – an integrated computer-assisted personalized assignment system. American Journal of Physics, 61(12), 1124–1130. https://doi.org/10.
1119/1.17307.
Kumar, A. N. (2013). A study of the influence of code-tracing problems on code-writing skills. Proceedings of the 18th ACM conference on innovation and technology in
computer science education, ITiCSE ’13 (pp. 183–188). ACM. https://doi.org/10.1145/2462476.2462507.
Lai, H., Gierl, M. J., Touchie, C., Pugh, D., Boulais, A.-P., & Champlain, A. D. (2016). Using automatic item generation to improve the quality of MCQ distractors.
Teaching and Learning in Medicine, 28(2), 166–173. https://doi.org/10.1080/10401334.2016.1146608.
Manoharan, S. (2017). Personalized assessment as a means to mitigate plagiarism. IEEE Transactions on Education, 60(2), 112–119. https://doi.org/10.1109/TE.2016.
2604210.
Marx, D. B., & Longer, D. E. (1986). Cheating on multiple choice exams is difficult to assess quantitatively. North American Colleges and Teachers of Agriculture Journal,
30(1), 23–26.
Meyer, E., & Weyl, E. (2017). CSS: The definitive guide (4th ed.). O'Reilly Media.
Richmond, P., & Roehner, B. M. (2015). The detection of cheating in multiple choice examinations. Physica A: Statistical Mechanics and Its Applications, 436(Supplement
C), 418–429. https://doi.org/10.1016/j.physa.2015.05.040.
Shon, P. C. H. (2006). How college students cheat on in-class examinations: Creativity, strain, and techniques of innovation, Plagiary: Cross-Disciplinary. Studies in
Plagiarism, Fabrication, and Falsification, 1(1), 130–148.
Smaill, C. (2005). The implementation and evaluation of OASIS: A web-based learning and assessment tool for large classes. IEEE Transactions on Education, 48(4),
658–663. https://doi.org/10.1109/TE.2005.852590.
Steven, T. M. H., & Downing, M. (Eds.). (2006). Handbook of test developmentRoutledgehttps://doi.org/10.4324/9780203874776.
The Quality Assurance Agency for Higher Education (2017). Contracting to cheat in higher education – how to address contract cheating, the use of third-party services and
essay mills. Gloucester, United Kingdom.
Walker, M., & Townley, C. (2012). Contract cheating: A new challenge for academic honesty? Journal of Academic Ethics, 10(1), 27–44. https://doi.org/10.1007/
s10805-012-9150-y.
Weld, D. S., et al. (2012). Personalized online education – a crowdsourcing challenge. Proceedings of the 26th AAAI conference on artificial intelligence.
Wesolowsky, G. O. (2000). Detecting excessive similarity in answers on multiple choice exams. Journal of Applied Statistics, 27(7), 909–921. https://doi.org/10.1080/
02664760050120588.
Young, A., & Shawl, S. J. (2013). Multiple choice testing for introductory astronomy: Design theory using Bloom's taxonomy. Astronomy Education Review, 12(1), 1–27.
Zare, S. (2011). Personalization in mobile learning for people with special needs. In C Stephanidis (Ed.). Universal access in human-computer interaction. Applications and
services, vol. 6768 of lecture notes in computer science (pp. 662–669). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-21657-2_71.
Zobel, J. (2004). “Uni cheats racket”: A case study in plagiarism investigation. Proceedings of the 6th Australasian conference on computing education: Vol. 30, (pp. 357–
365).
151

Cheat-Resistant Multiple-Choice Examinations Using Personalization

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Cheat-Resistant Multiple-Choice Examinations Using Personalization

Diunggah oleh

Hak Cipta:

Format Tersedia

Computers & Education 130 (2019) 139–151

Contents lists available at ScienceDirect

Computers & Education

Cheat-resistant multiple-choice examinations using personalization

ARTICLE INFO ABSTRACT

E-mail address: mano@cs.auckland.ac.nz.

Fig. 1. Collusion by scribing answer codes in big letters.

categories: technical, pedagogical, and administrative.

1. Would it be possible to architect a generic framework to enable personalized multiple-choice examinations?

2. Background and related work

1. Premeditated collusion by writing answers in big letters

1. Parameterization – substitute parameters with appropriate random values.

3.1. Administrative challenges

3.1.1. Examination delivery

3.1.2. Student and script identification

3.1.3. Scoring and script identification

3.2. Technical challenges

3.2.1. Supporting multiple types of selected-response questions

Fig. 2. Truth table for XYZ questions.

3.2.2. Duplicate answer options

3.2.3. Registering correct answers

3.2.4. Accounting for errors

3.3. Pedagogical challenges

3.3.1. Ensuring fairness

3.3.3. Coverage of learning outcomes

3.3.4. Quality assurance

Fig. 3. HTML macro processor.

Fig. 4. Sample HTML template (partial).

Fig. 5. Examination script (partial) corresponding to the template of Fig. 4.

X and Y only X only

5.1. Research methodology

Fig. 6. Sample XML specification of the question on Thorin's company.

5.2. Initial trial

5.2.1. Post-examination issues

(%) (%) (%) (%) (%)

Personalized tests help to reduce the level of cheating 3 3 4 32 58

5.3. Staff survey

5.4. Further feedback

1. “Less fear of cheating comrades next to me.”

2017 2018 2017 2018

5.5. Impact on student performance

5.6. Post-examination issues

7. Summary and conclusions

Anda mungkin juga menyukai