Anda di halaman 1dari 15

Original article

doi: 10.1111/j.1365-2729.2010.00368.x

Adaptive item-based learning environments based on the item response theory: possibilities and challenges
jcal_368 549..562

K. Wauters,* P. Desmet* & W. Van den Noortgate*


*iTEC, Interdisciplinary Research on Technology, Education and Communication, Katholieke Universiteit Leuven, 8500 Kortrijk, Belgium Faculty of Psychology and Educational Sciences, Katholieke Universiteit Leuven, 3000 Leuven, Belgium Franitalco, Research on French, Italian and Comparative Linguistics, Katholieke Universiteit Leuven, 3000 Leuven, Belgium

Abstract

The popularity of intelligent tutoring systems (ITSs) is increasing rapidly. In order to make learning environments more efcient, researchers have been exploring the possibility of an automatic adaptation of the learning environment to the learner or the context. One of the possible adaptation techniques is adaptive item sequencing by matching the difculty of the items to the learners knowledge level. This is already accomplished to a certain extent in adaptive testing environments, where the test is tailored to the persons ability level by means of the item response theory (IRT). Even though IRT has been a prevalent computerized adaptive test (CAT) approach for decades and applying IRT in item-based ITSs could lead to similar advantages as in CAT (e.g. higher motivation and more efcient learning), research on the application of IRT in such learning environments is highly restricted or absent. The purpose of this paper was to explore the feasibility of applying IRT in adaptive item-based ITSs. Therefore, we discussed the two main challenges associated with IRT application in such learning environments: the challenge of the data set and the challenge of the algorithm. We concluded that applying IRT seems to be a viable solution for adaptive item selection in item-based ITSs provided that some modications are implemented. Further research should shed more light on the adequacy of the proposed solutions. adaptive item selection, e-learning, intelligent tutoring systems, IRT, item-based learning.

Keywords

Introduction

Learning environments increasingly include an electronic component, offering some additional possibilities, such as making education more accessible for the large public by creating the possibility for learners to study at their own pace, anytime and anywhere. However, one shortcoming of most learning environments is that they are static, in the sense that they
Accepted: 14 May 2010 Correspondence: Kelly Wauters, Katholieke Universiteit Leuven Campus Kortrijk, Etienne Sabbelaan 53, 8500 Kortrijk, Belgium. Email: kelly.wauters@kuleuven-kortrijk.be

provide for each learner the same information in the same structure using the same interface. Research in the eld of educational sciences suggests that learners differ from each other with respect to their learning characteristics, preferences, goals and knowledge level and that these factors as well as context characteristics (e.g. location, time, etc.) have an inuence on the learning effectiveness (Kelly & Tangney 2006; Verd et al. 2008). Therefore, adaptive learning systems are developed aimed at optimizing learning conditions (Bloom 1984; Wasson 1993; Lee 2001; Verd et al. 2008). In the context of electronic learning environments, the term adaptivity is used to refer to the adjustment of
549

2010 Blackwell Publishing Ltd

Journal of Computer Assisted Learning (2010), 26, 549562

550

K. Wauters et al.

one or more characteristics of the learning environment in function of the learners needs and preferences and/or the context. More specically, adaptivity can relate to three dimensions: the form, the source and the medium. The form of adaptivity refers to the adaptation techniques that are implemented (Brusilovsky 1999; Brusilovsky & Peylo 2003; Paramythis & Loidl-Reisinger 2004). Forms can be summarized in three categories: adaptive form representation, adaptive content representation and adaptive curriculum sequencing. Adaptive form representation refers to the way the content is presented to the learner and includes, for instance, whether pictures and videos are added to the text and whether links are visible or highlighted. Adaptive content representation concentrates on providing the learner with intelligent help on each step in the problem-solving process based on discovered knowledge gaps of learners. Adaptive curriculum sequencing is intended to select the optimal question at any moment in order to learn certain knowledge in an efcient and effective way. The source of adaptivity refers to the features involved in the adaptivity process and can be classied into three main categories: course/item features, such as the difculty level and topic of the course; person features, composed of the learners knowledge level, motivation, cognitive load, interests and preferences; context features, which contains the time when, place from which and device on which the learner works in the learning environment; or a combination of these, for instance, the sequencing of the curriculum might be based on adapting the items difculty level to the learners knowledge level. Regarding the medium of adaptivity, a main distinction can be made between two types of adaptive learning environments: adaptive hypermedia (AH) and intelligent tutoring systems (ITSs) (Brusilovsky 1999). While AH systems are composed of a large amount of learning material presented by hypertext, ITSs often provide only a limited amount of learning material. Instead, ITSs intend to support a learner in the process of problem solving. ITSs can be further divided into taskbased and item-based ITSs. Task-based ITSs, such as ASSISTance and assessments (ASSISTments) (Razzaq et al. 2005), are composed of substantial tasks or problems that are often tackled by means of scaffolding problems, which break the main problem down into smaller, learnable chunks. The learner tries to come to a reasonable solution by applying knowledge and skills that are needed to address these problems. Item-based ITSs,

such as Franel (Desmet 2006) and Spanish translation of Intelligent Evaluation System using Tests for TeleEducation (SIETTE) (Conejo et al. 2004; Guzmn & Conejo 2005), are composed of simple questions that can be combined with hints and feedback. In the remainder of this paper, we will focus on adaptive curriculum sequencing in item-based ITSs by matching the item difculty level to the learners knowledge level. Recently, interests in adaptive item sequencing in learning environments have grown (e.g. Prez-Marn et al. 2006; Leung & Li 2007). Generally, it is found that excessively difcult course materials can frustrate learners, while excessively easy course materials can cause learners to lack any sense of challenge. Learners prefer learning environments where the item selection procedure is adapted to their knowledge level. Moreover, adaptive item sequencing can lead to more efcient and effective learning. Even though several methods are adopted to match the item difculty level to the learners knowledge level (e.g. Bayesian Networks; Collins et al. 1996; Milln et al. 2000; Conati et al. 2002), it seems obvious to make use of the item response theory (IRT; Van der Linden & Hambleton 1997) as it is frequently used in computerized adaptive tests (CATs) for adapting the item difculty level to the persons knowledge level (Lord 1970; Wainer 2000). IRT models are measurement models that specify the probability of a discrete outcome, such as the correctness of a response to an item, in terms of person and item parameters. In CAT, IRT is often used to generate a calibrated item bank, to estimate and update an examinees ability estimate and to decide on the next item to render. CAT can potentially decrease testing time and the number of testing items and increase measurement precision and motivation (Van der Linden & Glas 2000; Wainer 2000). Despite the fact that IRT is a well-established approach in testing environments, for instance, within CAT, to date, the implementation of IRT within learning environments is limited. This paper will elaborate on this issue. More precisely, the aim of this paper is to identify and discuss the possibilities and challenges of extrapolating the application of IRT for adaptive item sequencing for testing applications to item-based ITSs, partly based on own experiences with this type of itembased ITSs. Furthermore, based on techniques used for other types of learning and testing environments, we suggest possible tracks for dealing with those challenges. A third purpose of the paper is to guide future
2010 Blackwell Publishing Ltd

Adaptive ITSs based on IRT

551

empirical research in this eld, including the empirical evaluation of these ideas proposed in this paper. In what follows, we will rst address briey adaptive item sequencing in testing environments since it can help to understand adaptive item sequencing in learning environments. Next, we will discuss the differences between applying IRT for adaptive item sequencing in learning environments and in testing environments. In particular, we will address the possibilities and challenges of applying IRT for this purpose in item-based ITSs. Challenges regarding the data set and the algorithms will be highlighted, and some solutions for dealing with those challenges will be suggested.
Adaptive item sequencing in testing environments

As indicated earlier, the process of tailoring a test to a person can be based on IRT. IRT models are measurement models describing responses based on person and item characteristics. The Rasch model (Rasch 1960) is the simplest IRT model and expresses the probability of observing a particular response to an item as function of the item difculty level and the persons ability level. A prerequisite for adaptive item sequencing, and hence, for CAT, is to have items with a known difculty level. Therefore, an initial development of an item bank with items of which the item difculty level is known is needed before an adaptive test can be administered. This item bank should be large enough to include at any time an item with a difculty level within the optimal range that was not yet presented to the person. To calibrate the items of the item bank, i.e. to estimate the item parameters (e.g. difculty level), typically, a non-adaptive test is taken by a large sample of persons. Guidelines for the number of persons needed to obtain reliable item parameter estimates vary between 200 and 1000 persons (e.g. Huang 1996; Wainer & Mislevy 2000). Following item calibration, several steps are taken within a CAT as illustrated in Fig 1. When no information is provided about the ability level of the person, the rst items that are administered are often items with an average item difculty level. After a few items are administered, the persons ability parameter is estimated by means of his or her answer on these items. Next, the item selection algorithm chooses the item that provides the most information about the current ability estimate as the next item. Within the framework of the
2010 Blackwell Publishing Ltd

Fig 1 Computerized adaptive testing procedure.

Rasch model, based on which item difculties and person abilities are directly comparable, this is an item whose difculty level is equal to the ability level of the person. Next, the persons ability parameter is re-estimated based on all previous items. Based on the updated estimate an additional item is selected. These steps are repeated until a certain stopping criterion is met, for instance, when the ability estimate hardly changes or when a specic number of items is reached.
Adaptive item sequencing in learning environments

The research to date has tended to focus on applying IRT for adaptive item sequencing in testing environments rather than in learning environments. Furthermore, the research on IRT in learning environments is mainly concentrated on assessment, both formative and summative (Milln et al. 2003; Guzmn & Conejo 2005; Chen et al. 2006, 2007; Leung & Li 2007; Baylari & Montazer 2009; Chen & Chen 2009). However, some learning environments already make use of IRT to adaptively sequence the learning material (Chen et al. 2005, 2006; Chen & Chung 2008; Chen & Duh 2008; Chen & Hsu 2008). These researchers have applied original and modied IRT with success in AHs. Nevertheless, our experience with item-based ITSs,

552

K. Wauters et al.

such as Franel (Desmet 2006), shows that the application of IRT in item-based ITSs has distinct possibilities and challenges. Franel is a freely accessible item-based learning environment for French-speaking persons to learn Dutch and for Dutch-speaking persons to learn French. To optimize language learning, Franel is composed of several didactical item types, for instance, multiple choice, ll in the blank and translate, and several media features, such as video and audio. Learners are also guided towards the correct response by means of hints and elaborated feedback. This elaborated feedback, such as error-specic feedback, can depend on the answer the learner has given and therefore, can identify and correct knowledge gaps. Furthermore, learners can choose to follow a guided path or to follow their own learning path. Analysis has shown that learners tend to follow their own learning path, in which they can freely select the items they want to make. The item pool contains thousands of items. The items can be chosen from a navigation menu. The appearance of the navigation menu is not adapted to the learner, which means that every learner is presented the same navigation menu with the items in the same order. The learner can choose whether he or she wants the items to be categorized within chapters/topics (e.g. hobby, work, etc.) or whether he or she wants the items to be categorized within domains (e.g. grammar, vocabulary, etc.). Besides the topic and domain, there is no further logical order in which the items are listed within the navigation menu. Once a learner has selected an item, the learner is free to answer the item, ask for a hint or the solution or go to the next item without making any attempts. That the implementation of IRT for item-based ITSs has received substantially less research attention could be due to the fact that extrapolating the ideas of CAT and IRT to such learning environments is not straightforward. Our experience with the tracking and logging data of item-based ITSs such as Franel (Desmet 2006) taught us that two major challenges arise when we want to apply IRT for adaptive item sequencing in such learning environments. A rst challenge implies building a data set that yields sufcient and accurate information about the person and item characteristics. A second challenge is the question of the model to be applied for adaptive item selection in item-based ITSs. In what follows, each of those challenges will be discussed.

The challenge of the data set

The difference in the data-gathering procedure of learning and testing environments has implications for IRT application in learning environments. In learning environments, data are generally collected in a less structured way, resulting in a missing data problem. Moreover, analysis and interpretation can be hampered by the skipping of items. Missing values In item-based ITSs, learners are free to choose the exercises they want to make. This, combined with the possibly vast amount of exercises provided within a learning environment, leads to the nding that many exercises are only made by few learners. Even though IRT can deal with structural incomplete data sets, the structure and huge amount of missing values found in the tracking and logging data of learning environments can easily lead to non-converging estimations of the IRT model parameters. Hence, the challenge of the missing values is twofold: we need to obtain the structure of data needed for item calibration and we need to obtain the amount of data needed for reliable item parameter estimation. In practical applications of IRT, the number of items that have to be calibrated is often so large that it is practically not feasible to administer the entire set of items to one sample of learners. IRT models are able to analyse incomplete data under certain circumstances (Eggen 1993). More specically, even though learners have not answered all items, a common measurement scale can be constructed on which all items can be located. There is, however, a problem regarding the structure of the data if there is no overlap in items solved by one or several persons with those solved by other persons. Without making additional assumptions, this lack of overlap does not allow assessing the difculty of the items and/or the ability of the persons since in IRT, this is performed relative to the other items or persons. Unfortunately, this lack in overlap might be encountered in learning environments, especially if learners are completely free in selecting items from a large item pool, as is the case in Franel. A method that can be applied to obtain the structure needed for item calibration is to administer all learners some common items. This would result in a calibration design such as the one presented in Fig 2. This incom 2010 Blackwell Publishing Ltd

Adaptive ITSs based on IRT

553

Item set 1 Learner 1 Learner 2


Fig 2 A common items design. The black cells represent the common items. The lighter cells represent the non common items.

Item set 2

Item set 3

Item set 4

Item set 5

Learner 3 Learner 4

Item set 1 Learner Sample 1


Fig 3 A block-interlaced anchoring design. The cells range from dark grey to pale grey. The lighter the cells, the smaller the number of items belonging to that item set that are made by each individual learner within the learner sample and the smaller the overlap across the different learners in items made.

Item set 2

Item set 3

Item set 4

Item set 5

Learner Sample 2 Learner Sample 3 Learner Sample 4 Learner Sample 5

plete calibration design with a common items anchor has the advantage that neither the equivalence of learners nor the equivalence of items needs to be assumed. Additionally, an advantage for the course creator is that he or she has some control over the item administration, such that non-overlapping parts in the data set can be avoided. A drawback of the approach, however, is that at the end, we have a lot of information about the common items but much less about the other items. Moreover, while item development is a very expensive and time intensive enterprise, not all items will be used to the same degree, and possibly, some items will hardly be used. This problem of unequal use of items might also occur in learning environments in which learners can choose items freely out of a xed navigation menu but select items according to a certain pattern (for instance, learners try to solve the items in the order they are listed within the navigation menu). A solution to this problem is to vary the order of the items as listed in the learning environment. For instance, if all learners select some items from the rst topic, less items from the second topic, even less items from the third topic, etc., systematically rotating the order in which the topics are presented in the navigation menu (last topic comes rst, rst second, second third and so on) could lead to a
2010 Blackwell Publishing Ltd

calibration design similar to the one presented in Fig 3. Applying this incomplete calibration design with a block-interlaced anchor within learning environments has two main advantages: an anchoritem effect is achieved while the number of administrations across items is kept more equal and both the course creator and the learner have some control over the item administration. A drawback is that it is still possible, though less likely, to end up with a data set with non-overlapping parts. The amount of data refers to the data-centric character of IRT models, which implies that enough data are required to estimate the item parameters. Some researchers have used prior calibration in learning environments for item difculty estimation (Tai et al. 2001; Guzmn & Conejo 2005; Chen & Duh 2008). However, this is a time-consuming and costly procedure and therefore less appropriate for learning environments with lots of items. Alternatively, online calibration can be employed in which both the learners knowledge level and the difculty level of the new items are estimated during the learning sessions. Makransky and Glas (2010) explored a continuous updating strategy that yields a low mean average error for the ability estimate. In this continuous updating strategy, items are

554

K. Wauters et al.

initially randomly selected. After each exposure, the items are (re)calibrated. Only when an item is administered a certain amount of times does it become eligible for adaptive item selection. A second issue is that more persons are required to achieve the same level of precision with larger item banks. Besides, when the number of available items becomes large, which is the case in our item-based ITS, it will take longer for randomly selected items to be administered a certain amount of times in order to become eligible for adaptive item selection. A method to inuence the administration or exposure rate of items, such that the amount of data needed for reliable item calibration is obtained faster than with random item selection, is by applying an item exposure control algorithm. Item exposure control methods are often applied in CAT (Georgiadou et al. 2007). The underlying philosophy of CAT, namely selecting items that provide the most information about the current ability estimate, leads to overexposure of the most informative items, while other items, which provide less information, are rarely selected. Several item exposure control strategies have been proposed to prevent overexposure of some items and to increase the use rate of rarely or never selected items (Georgiadou et al. 2007). Therefore, in CAT, the objective of such item exposure control strategies is to equalize the frequency with which all items are administered. Yet, the objective of item exposure control strategies in itembased ITSs would rather be increasing the administration frequency of items that are close to being reliably calibrated. A major advantage of item exposure control strategies in item-based ITSs is the decrease in the time needed for items to become eligible for adaptive item selection. An item exposure control method that imposes item-ineligibility constraints on the assembly of shadow tests seems to be a reasonable solution that can be applied within item-based ITSs (Van der Linden & Veldkamp 2004, 2007). So far, this method is only implemented in testing environments. In this item exposure control strategy, the selection of each new item is preceded by an online assembly of a shadow test. This implies that items are not selected directly from the item pool but from a shadow test. The shadow test is a fullsized test that is optimal at the persons current ability estimate and contains all items, administered or not, that meet all the constraints of the adaptive test. The optimal item at the ability estimate from the free items in the shadow test is subsequently selected for administration.

After the item is administered, the items in the shadow test that are not yet administered are returned to the item pool, the ability estimate is updated and the procedure is repeated. An asset of this method is that it can deal with a large range of constraints as long as there is a computer algorithm available. Hence, an itemineligibility constraint can be implemented that reduces the exposure of overexposed items, that has a positive effect on the exposure rate of underexposed items (Van der Linden & Chang 2003; Van der Linden & Veldkamp 2004) or that, in the light of item-based ITSs, has a positive effect on the exposure rate of items that have an exposure rate close to one required for reliable item calibration. Skipped items Besides the large amount and pattern of missing values, intentionally skipping items may pose problems. In Franel, learners are not constrained to answer each selected item. The decision of skipping an item may depend on the learners knowledge level and on the item difculty, making the missing data not missing at random or missing completely at random. Research has indicated that treating the omitted responses as wrong is not appropriate (Lord 1983; Ludlow & OLeary 1999; De Ayala et al. 2001). Goegebeur et al. (2006) proposed an IRT model for dealing with this kind of omitted responses, more specically, responses missing not at random (MNAR) in a testing environment. This IRT model models missing observations simultaneously with the observed responses by threatening the data as if there were three response categories: correct, incorrect and no response. The model also accounts for test speed. Goegebeur et al. (2006) applied this model to the Chilean Sistema di Medicin de la Calidad de la Educacin mathematics test data set and results show that the parameter estimates are improved with this model compared with the estimates based on an extended one-parameter logistic model of complete proles. However, the authors emphasize that caution is in order since the results can be strongly affected by the mechanism underlying the MNAR framework.
The challenge of the algorithm

Next to the challenge of the data set, the IRT application in learning environments is confronted with a second
2010 Blackwell Publishing Ltd

Adaptive ITSs based on IRT

555

challenge, namely the challenge of the algorithm. This challenge is partly derived from the difference in the data-gathering procedure between learning and testing environments and partly derived from the difference in the objective of learning and testing environments. We divide the challenge of the algorithm into three subproblems, namely, the problem in estimating the item difculty level, the problem in estimating the learners ability level and the problem in dening the item selection algorithm. Item difculty estimation Because of the different problems facing IRT-based item calibration in learning environments that arise from the data-gathering procedure, more specically, the missing values and the skipped items, other approaches to estimate the item difculty level have been proposed. A simple approach to estimate item difculty that is not directly related to the IRT model is the number of learners who have answered the item correctly divided by the number of learners who have answered the item. This proportion of correct answers has the benet that it is not based on a prior study but can be calculated online. The lower the proportion of persons who answer an item correct, the more difcult the item is. Johns et al. (2006) have compared the item difculty levels obtained by training an IRT model with the proportion correct answers. Even though the correlation between those two measures of item difculty across 70 items was relatively high (r = +0.68), the proportion of correct answers is subject to the knowledge level of the learners who have answered that item and to the amount of learners who have answered that item. Hence, although attractive due to its simplicity, the approach throws away an important strength of IRT: the proportion of correct answers for an item depends on the sample of persons who answered the item, and therefore, these proportions are only comparable over items if the group of persons is comparable. Another method that has been used to estimate item difculty in AHs is by means of the learners feedback on simple questions after each item (e.g. Chen et al. 2005, 2006; Chen & Duh 2008), such as Do you understand the content of the recommended course materials? After a learner has given feedback, scores are aggregated with those of other learners who previously answered this question. The new difculty level of the course mate 2010 Blackwell Publishing Ltd

rial is based on a weighted linear combination of the course difculty as dened by course experts and the course difculty determined from collaborative feedback of the learners. The difculty parameters slowly approach a steady value as the number of learners increases. Another approach to obtain item parameter estimates is allowing subject domain experts to estimate the value of the difculty parameter (Yao 1991; Linacre 2000; Fernandez 2003; Lu et al. 2007). There is some evidence in the measurement literature that test specialists are capable of estimating item difculties with reasonable accuracy (e.g. Chalifour & Powers 1989), although other studies found contradictory results (Hambleton et al. 1998). Another method that has been used in testing environments to estimate the item difculty level of new items is the paired comparison method (Ozaki & Toyoda 2006, 2009). In this method, items for which the difculty parameter has to be estimated are compared with multiple items, of which the item difculty parameter is known. This is performed sequentially or simultaneously, i.e. a one-toone comparison or a one-to-many comparison, respectively. Learner feedback, expert feedback and paired comparison have two common limitations. First, these three alterative estimation approaches are less applicable for the calibration of an entire, large, already existing item bank since asking the learner for feedback after each item administration requires both considerable time and mental effort, possibly interrupting the learning process. However, these estimation approaches seem to be viable solutions for estimating the item difculty level of new added items. Second, these estimation approaches are more subjective than IRT-based calibration. Ability estimation The objective of adaptive learning environments is to foster efcient learning by providing learners a personalized course/item sequence. Obtaining the learners ability estimate at an early stage of the learning process and following the learning curve adequately is required in order for adaptive learning environments to be efcient. Hence, the problem of the ability estimation is twofold. On the one hand, we need to estimate the learners ability level when little information is provided, which is referred to as the cold start problem. On the other hand, we need to be able to follow the progression of the learner.

556

K. Wauters et al.

The cold start problem refers to the situation where a learner starts working in the learning environment. After having answered a few items, the ability estimate is not very accurate, reducing the efciency of adaptive item selection. Therefore, two partial solutions can be provided. On the one hand, it can be feasible to use personal information, such as previous education, occupation, age and gender to obtain and improve the learners ability estimate. This technique can be extended by assigning weights to these criteria to model their relative importance (Masthoff 2004). On the other hand, it is advisable to apply other item selection methods than the one based on maximizing the Fisher information at the current estimated ability level because using Fisher information when the estimated ability level is not close to the true ability level could be less efcient than assumed. A potential solution to this problem is modifying the item selection algorithm by taking into consideration the uncertainty of the estimated ability level. Several modied item selection algorithms have been proposed: Fisher interval information (Veerkamp & Berger 1997), Fisher information with a posterior distribution (Van der Linden 1998), KullbackLeibler information (Chang & Ying 1996) and KullbackLeibler information with a posterior distribution (Chang &Ying 1996). For a comparison of these item selection algorithms, see Chen et al. (2000). The problem of the evolving ability level arises because learning is likely to take place on the basis of hints and feedback. Besides the change in the ability level within a learning session, the ability of the learner might increase between two successive learning sessions, for instance, due to learning in other courses they follow, or might decrease, for instance, due to forgetting. This monitoring of the learners progress is an important research focus in the eld of educational measurement. One method is progress testing. In progress testing, tests are frequently administered to allow for a quick intervention when atypical growth patterns are observed. However, a major drawback of this approach is that it requires regular ability assessments that should be long enough to be accurate, making it less appropriate for learning environments. Another method to model the learners ability progress is by updating the learners knowledge level after each item administration as is the case in the Elo rating system (Elo 1978), which was recently implemented in the educational eld (Brinkhuis & Maris 2009). In item-based ITSs, the

Elo rating system can be seen as an instance of paired comparison where the learner is seen as a player and the item is seen as its opponent. In the formula of Brinkhuis and Maris (2009), the new ability rating after an item administration is function of the pre-administration rating, a weight given to the new observation and the difference between the actual score on the new observation and the expected score on the new observation. This expected score is calculated by means of the Rasch model. This means that when the difference between the expected score and the observed score is high, the change in the ability estimate is high. A merit of this algorithm is that it makes it possible to quickly follow changing knowledge levels and rapidly perceive an atypical growth pattern. Item selection algorithm The difference in objectives between a testing and a learning environment also asks for a revision of the item selection algorithm as it is implemented in testing environments. The objective of a testing environment is to measure as precisely as possible the persons ability level. It can be shown that for the Rasch model, items with a success probability of 50% are optimally informative. However, the administration of items for which the person has a success rate of only 50% can decrease the persons motivation (Andrich 1995) and increase test anxiety (Rocklin & Thompson 1985), possibly resulting in an underestimation of the persons ability (Betz & Weiss 1976; Rocklin & Thompson 1985). Administering easier items in order to increase motivation and decrease test anxiety is non-optimal from a psychometric point of view. However, the impact on the measurement precision and test length is modest (Bergstorm et al. 1992; Eggen & Verschoor 2006). Another method applied in CAT resulting in a decrease in test anxiety and an increase in motivation and performance is allowing the person to select the item difculty level of the next item from among a number of difculty levels (Wise et al. 1992; Vispoel & Coffman 1994; Roos et al. 1997). Although most persons choose a difculty level that lies close to their estimated ability level (Johnson et al. 1991; Wise et al. 1992), a problem with these self-adapted tests still remains that a few persons choose a difculty level that is not well-matched to their ability, resulting in a aw in the psychometric demands.
2010 Blackwell Publishing Ltd

Adaptive ITSs based on IRT

557

In contrast, the objective of a learning environment is to optimize learning efciency. The effect of motivation herein is even of greater importance than in testing and especially than in high-stake testing. Unmotivated learners will likely stop using the learning environment. Hence, especially for learning environments items for which the person has a success probability of above 50% should be considered. Next to increasing the success probability in order to increase motivation, it is possible to give the learner some control over the item difculty level, comparable to self-adapted tests. Besides motivation, the learning outcome should also be kept optimal. Research results regarding the effect of item difculty on the learning progress are not conclusive. Some studies have indicated that learners learn more from easy items (Pavlik & Anderson 2008), while some theories suggest that learners learn more from items that are slightly more difcult given the learners ability level (Hootsen et al. 2007), and yet, other studies have indicated that learners learn as much from easy groups of learning opportunities as from difcult ones (Feng et al. 2008).
Discussion and conclusion

Because IRT has been a prevalent approach for adaptive item sequencing in testing environments, the question that is tackled within this paper is whether IRT is suitable for adaptive item sequencing in item-based ITSs. Based on own experiences with item-based ITSs and ideas borrowed from testing environments and other types of learning environments, we found some challenges in applying IRT for adaptive item sequencing in item-based ITSs and made some suggestions to handle these challenges (see Table 1). A rst main challenge that is brought up is that the structure of the data set and the extent of missing values can make item difculty estimation problematic. Solutions that are suggested include the implementation of a calibration design, such as a common items design and a block-interlaced anchoring design. The former has the advantages that neither the equivalence of learners nor the equivalence of items needs to be assumed. However, the common items are administered more frequently and are therefore more reliably estimated than the noncommon items. The latter has the same advantages as the common items design without its disadvantage. Because learners seem to like some control over the
2010 Blackwell Publishing Ltd

learning system (e.g. select the chapter), a blockinterlaced anchoring design seems to be preferable in item-based ITSs as it provides both the course creator and the learner some control over the item administration. Furthermore, an online calibration method to evolve from random item administration to adaptive item administration is suggested. The advantage of this continuous updating strategy is that items can easily be added in the learning environment without the requirement of prior calibration while keeping the measurement error of the learners knowledge level low. Finally, an item exposure control algorithm can be specied that makes items for which the item exposure rate is close to the one required for reliable item difculty estimation, eligible for adaptive item administration. This item exposure control method that imposes item ineligibility constraints has the benet of reducing the time needed for item calibration. Combining the proposed solutions might also be considered. For example, when we want to measure a learners ability level while calibrating an item bank, it might be reasonable to combine a continuous updating calibration strategy with an item exposure control algorithm. In such situations, a part of the learning session is composed of randomly selected items with an unknown difculty level. In another part, items are selected that have been administered already a certain amount of times but from which the difculty level cannot yet be reliably estimated. Items are selected that are closest to being reliably calibrated, and the item difculty parameter estimate is updated after each item administration. In a last part, adaptive item selection can be applied using items for which the difculty level is already sufciently known. A difculty is that we do not know in advance how many items a learner will complete in one learning session. A second main challenge, largely intertwined with the challenge of the data set, concerns the implemented algorithms, which is composed of the item difculty estimation, the learners ability level estimation and the item selection algorithm. The alternative applicable techniques for item difculty estimation that are proposed in this paper are proportion correct, learners feedback, expert rating and paired comparison. Compared with IRT-based calibration, these alternative techniques are more sample-dependent (proportion correct) or more subjective because persons have to make a decision (learners feedback, expert rating and paired comparison). Besides, learners feedback, expert rating and

558

Table 1. Challenges and proposed solutions of applying IRT for adaptive item sequencing in item-based ITSs.
Advantages Disadvantages Applications Reference

Challenge

Problem

Proposed solution

Data set No assumption needed on the equivalence of learners and items No assumption needed on the equivalence of learners and items More equal administration of all items Easiness in adding items to item pool without prior calibration Low measurement error in ability estimation Length of learning session is preferably known Faster calibration More accurate parameter estimates Always estimable Subject to sample of learners used More complex model and hence more data required for estimation The calibration of an item bank Non-equal administration of items The calibration of an item bank

Missing values Eggen 1993 Eggen 1993

Structure of data

Common items design

Block-interlaced anchoring design

Amount of data

Continuous updating strategy

The calibration of an item bank

Makransky and Glas 2010

Item-ineligibility constraints on the assembly of shadow tests

The calibration of an item bank Estimation of learners ability level

Van der Linden & Veldkamp 2004, 2007 Goegebeur et al. 2006

Skipped items

Simultaneous modeling of the missing observations with the observed responses

Algorithm

Item difculty

Proportion correct

Estimation of item difculty level when item calibration is not possible Prior in Bayesian estimation algorithm Estimation of item difculty level of new items Prior in Bayesian estimation algorithm Estimation of item difculty level of new items Prior in Bayesian estimation algorithm Estimation of item difculty level of new items Prior in Bayesian estimation algorithm

Johns et al. 2006

Learners feedback

Always estimable

Interrupts learning process Time consuming

Chen et al. 2005

Experts feedback

Always estimable

Time consuming

Linacre 2000

Paired comparison

Always estimable

Interrupts learning process Time consuming

Ozaki & Toyoda 2006, 2009

Ability level

Weighted adaptation based on background characteristics

Indication of learners ability level

Loss of some information

Ability level estimation on the basis of few responses In combination with Elo rating system More radical than weighted adaptation based on background characteristics Ability level estimation on the basis of few responses Tracking of ability level in item-based learning environments

Masthoff 2004

Alternative item selection algorithm

More efcient than Fisher information when ability level is unknown Perception of atypical learning patterns Easily extendible

Chen et al. 2000

Elo rating systems

Brinkhuis & Maris 2009

K. Wauters et al.

2010 Blackwell Publishing Ltd

Adaptive ITSs based on IRT

559

paired comparison are less applicable for an entire item bank calibration as it is time-consuming, and it can interrupt the learning process. Nevertheless, all the mentioned alternative estimation methods can be applied for item difculty estimation of new items. Furthermore, we can implement the item difculty values that are obtained with one of those alternative estimation methods into the IRT-based estimation algorithm, yielding a faster acquisition of reliable item difculty estimates (Swaminathan et al. 2003). The algorithm to estimate the learners ability level faces two problems: the cold start problem and the problem of the evolving ability level. The former problem can be solved by weighted adaptation based on background characteristics and a modied item selection algorithm. The latter problem can be solved by implementing the Elo rating system to track the learners ability. The advantages of the Elo rating system are that it can quickly follow the change in knowledge level and that it can easily be extended. An example of this last advantage is the incorporation of background characteristics. When the learner starts working in the learning environment, a realistic starting value of the learners knowledge level based on background characteristics can enhance the estimation process in the Elo rating system. The problem of the item selection algorithm broaches the question of what item selection criterion should be implemented and of whether learners should have some control over it. The rst question focuses on whether to select items that maximize the measurement precision and thereby leaving the learner with a probability of 50% to answer an item correctly or whether to select more easy or difcult items. The answer to this question should be explored into more detail in further experimental research. The second question concerns the learners control over the item selection algorithm. More specically, should the learner be provided the freedom to choose the item difculty level of the next item or not? Studies in CAT have shown that giving a person control leads to higher motivation. However, because the persons do not always select the item difculty that matches their current ability estimate, the measurement precision is biased. We could combine this research result with the research results regarding the modest impact of using psychometrically nonoptimal items on the measurement precision. This could yield an item-based ITS where learners have the possibility to select the item difculty level of the next item
2010 Blackwell Publishing Ltd

out of a specic item difculty range. This difculty range does not lead to a high impact on the measurement precision. In such a situation, the learners perceive that they have some control over the learning environment, which might lead to higher motivation. We can conclude that IRT is potentially a valuable method to adapt the item sequencing in item-based ITSs to the learners knowledge level. Applying IRT to learning environments, however, is not straightforward. In this paper, we wanted to identify associated problems and suggested solutions based on the literature on testing and learning environments, whether they were IRT-based or not. Further research in which these ideas are empirically evaluated is required.
References
Andrich D. (1995) Review of the book computerized adaptive testing: a primer. Psychometrika 4, 615620. Baylari A. & Montazer G.A. (2009) Design a personalized e-learning system based on item response theory and articial neural network approach. Expert Systems with Applications 36, 80138021. Bergstorm B.A., Lunz M.E. & Gershon R.C. (1992) Altering the level of difculty in computer adaptive testing. Applied Measurement in Education 5, 137149. Betz N.E. & Weiss D.J. (1976) Psychological Effect of Immediate Knowledge of Results and Adaptive Ability Testing. (Rep. No. 76-4). University of Minnesota, Department of Psychology, Psychometric Methods Program, Minneapolis, MN. Bloom B.S. (1984) The 2-sigma problem: the search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher 13, 416. Brinkhuis M.J.S. & Maris G. (2009) Dynamic Parameter Estimation in Student Monitoring Systems. Measurement and Research Department Reports (Rep. No. 2009-1). Cito, Arhnem. Brusilovsky P. (1999) Adaptive and intelligent technologies for Web-based education. Knstliche Intelligenz 4, 1925. Brusilovsky P. & Peylo C. (2003) Adaptive and intelligent Web-based educational systems. International Journal of Articial Intelligence in Education 13, 156169. Chalifour C.L. & Powers D.E. (1989) The relationship of content characteristics of GRE analytical reasoning items to their difculties and discriminations. Journal of Educational Measurement 26, 120132. Chang H.H. & Ying Z.L. (1996) A global information approach to computerized adoptive testing. Applied Psychological Measurement 20, 213229.

560

K. Wauters et al.

Chen C.M. & Chen M.C. (2009) Mobile formative assessment tool based on data mining techniques for supporting Webbased learning. Computers & Education 52, 256273. Chen C.M. & Chung C.J. (2008) Personalized mobile English vocabulary learning system based on item response theory and learning memory cycle. Computers & Education 51, 624645. Chen C.M. & Duh L.J. (2008) Personalized Web-based tutoring system based on fuzzy item response theory. Expert Systems with Applications 34, 22982315. Chen C.M., Hsieh Y.L. & Hsu S.H. (2007) Mining learner prole utilizing association rule for Web-based learning diagnosis. Expert Systems with Applications 33, 622. Chen C.M. & Hsu S.H. (2008) Personalized intelligent mobile learning system for supporting effective English learning. Educational Technology & Society 11, 153180. Chen C.M., Lee H.M. & Chen Y.H. (2005) Personalized e-learning system using item response theory. Computers & Education 44, 237255. Chen C.M., Liu C.Y. & Chang M.H. (2006) Personalized curriculum sequencing utilizing modied item response theory for Web-based instruction. Expert Systems with Applications 30, 378396. Chen S.Y., Ankenmann R.D. & Chang H.H. (2000) A comparison of item selection rules at the early stages of computerized adaptive testing. Applied Psychological Measurement 24, 241255. Collins J.A., Greer J.E. & Huang S.X. (1996) Adaptive assessment using granularity hierarchies and Bayesian nets. In Proceedings of the Third International Conference on Intelligent Tutoring Systems (eds C. Frasson, G. Gauthier & A. Lesgold), pp. 569577. Springer-Verlag, London. Conati C., Gertner A. & Vanlehn K. (2002) Using Bayesian networks to manage uncertainty in student modeling. User Modeling and User-Adapted Interaction 12, 371417. Conejo R., Guzmn E., Milln M.T.E., Trell M., Perez-de-laCruz J.L. & Rios A. (2004) SIETTE: a Web-based tool for adaptive testing. International Journal of Articial Intelligence in Education 14, 2961. De Ayala R.J., Plake B.S. & Impara J.C. (2001) The impact of omitted responses on the accuracy of ability estimation in item response theory. Journal of Educational Measurement 38, 213234. Desmet P. (2006) Lapprentissage/enseignement des langues lre du numrique: tendances rcentes et ds. Revue Franaise de Linguistique Applique 11, 119138. Eggen T.J.H.M. (1993) Itemresponstheorie en onvolledige gegevens. In Psychometrie in de Praktijk (eds T.J.H.M. Eggen & P.F. Sanders), pp. 239284. Cito, Arhnem. Eggen T.J.H.M. & Verschoor A.J. (2006) Optimal testing with easy or difcult items in computerized adaptive

testing. Applied Psychological Measurement 30, 379 393. Elo A.E. (1978) The Rating of Chess Players, Past and Present. B.T. Batsford Ltd., London. Feng M., Heffernan N., Beck J.E. & Koedinger K. (2008) Can we predict which groups of questions students will learn from? In Proceedings of the 1st International Conference on Education Data Mining (eds R.S.J.D. Baker & J.E. Beck), pp. 218225. Retrieved from http://www. educationaldatamining.org/EDM2008/uploads/proc/ full%20proceedings.pdf. Fernandez G. (2003) Cognitive scaffolding for a Web-based adaptive learning environment. In Advances in Web-Based Learning, Lecture Notes on Computer Science (eds G. Goos, J. Hartmanis & J. van Leeuwen) 2783, pp. 1220. Springer, Berlin. Georgiadou E., Triantallou E. & Economides A.A. (2007) A review of item exposure control strategies for computerized adaptive testing developed from 1983 to 2005. Journal of Technology, Learning and Assessment 5, 438. Goegebeur Y., De Boeck P., Molenberghs G. & Pino G. (2006) A local-inuence-based diagnostic approach to a speeded item response theory model. Journal of the Royal Statistical Society Series C, Applied Statistics 55, 647676. Guzmn E. & Conejo R. (2005) Self-assessment in a feasible, adaptive Web-based testing system. IEEE Transactions on Education 48, 688695. Hambleton R.K., Bastari B. & Xing D. (1998) Estimating Item Statistics. Laboratory of Psychometric and Evaluative Research (Rep. No. 298). University of Massachusetts, School of Education, Amherst, MA. Hootsen G., van der Werf R. & Vermeer A. (2007) E-learning op maat: automatische gendividualiseerde materiaalselectie in het tweede-taalonderwijs. Toegepaste Taalwetenschap in Artikelen 78, 119130. Huang S.X. (1996) A content-balanced adaptive testing algorithm for computer-based training systems. In Intelligent Tutoring Systems, Lecture Notes in Computer Science (eds C. Frasson, G. Gauthier & A. Lesgold), pp. 306314. Springer, Heidelberg. Johns J., Mahadevan S. & Woolf B. (2006) Estimating student prociency using an item response theory model. Intelligent Tutoring Systems, Lecture Notes in Computer Science 4053, 473480. Johnson P.L., Roos L.L., Wise S.L. & Plake B.S. (1991) Correlates of examinee item choice behavior in selfadapted testing. Mid-Western Educational Researcher 4, 2529. Kelly D. & Tangney B. (2006) Adapting to intelligent prole in an adaptive educational system. Interacting with Computers 18, 385409.

2010 Blackwell Publishing Ltd

Adaptive ITSs based on IRT

561

Lee M.G. (2001) Proling students adaptation styles in Webbased learning. Computers & Education 36, 121132. Leung E.W.C. & Li Q. (2007) An experimental study of a personalized learning environment through open-source software tools. IEEE Transactions on Education 50, 331 337. Linacre J.M. (2000) Computer-adaptive testing: a methodology whose time has come. In Development of Computerized Middle School Achievement Test (eds S. Chae, U. Kang, E. Jeon & J.M. Linacre), pp. 358. Komesa Press, Seoul. Lord F.M. (1970) Some test theory for tailored testing. In Computer-Assisted Instruction, Testing, and Guidance (ed. W.H. Holtzman), pp. 139183. Harper & Row, New York. Lord F.M. (1983) Maximum-likelihood estimation of item response parameters when some responses are omitted. Psychometrika 48, 477482. Lu F., Li X., Liu Q.T., Yang Z.K., Tan G.X. & He T.T. (2007) Research on personalized e-learning system using fuzzy set based clustering algorithm. Computational Science ICCS 2007 4489, 587590. Ludlow L.H. & OLeary M. (1999) Scoring omitted and not-reached items: practical data analysis implications. Educational and Psychological Measurement 59, 615630. Makransky G. & Glas C.A.W. (2010) Bootstrapping an item bank: an automatic online calibration design in adaptive testing. Journal of Applied Testing Technology (in press). Masthoff J. (2004) Group modeling: selecting a sequence of television items to suit a group of viewers. User Modeling and User-Adapted Interaction 14, 3785. Milln E., Garcia-Hervas E., Riscos E.G.D., Rueda A. & Perez-de-la-Cruz J.L. (2003) TAPLI: an adaptive Webbased learning environment for linear programming. Current Topics in Articial Intelligence 3040, 676685. Milln E., Perez-de-la-Cruz J.L. & Suarez E. (2000) Adaptive Bayesian networks for multilevel student modeling. Intelligent Tutoring Systems, Proceedings 1839, 534543. Ozaki K. & Toyoda H. (2006) Paired comparison IRT model by 3-value judgment: estimation of item parameters prior to the administration of the test. Behaviormetrika 33, 131 147. Ozaki K. & Toyoda H. (2009) Item difculty parameter estimation using the idea of the graded response model and computerized adaptive testing. Japanese Psychological Research 51, 112. Paramythis A. & Loidl-Reisinger S. (2004) Adaptive learning environment and e-learning standards. Electronic Journal of eLearning 2, 181194. Pavlik P.I. & Anderson J.R. (2008) Using a model to compute the optimal schedule of practice. Journal of Experimental Psychology. Applied 14, 101117.

Prez-Marn D., Alfonseca E. & Rodriguez P. (2006) On the dynamic adaptation of computer assisted assessment of free-text answers. In Adaptive Hypermedia and Adaptive Web Based Systems, Lecture Notes on Computer Science (eds V. Wade, H. Ashman & B. Smith) 4018, pp. 374377. Springer, Berlin. Rasch G. (1960) Probabilistic Models for Some Intelligence and Attainment Tests. Institute of Educational Research, Copenhagen. Razzaq L., Feng M., Nuzzo-Jones G., Heffernan N.T., Koedinger K.R., Junker B., Ritter S., Knight A., Aniszczyk C., Choksey S., Livak T., Mercado E., Turner T.E., Upalekar R., Walonoski J.A., Macasek M.A. & Rasmussen K.P. (2005) The assistment project: blending assessment and assisting. In Proceedings of the 12th International Conference on Articial Intelligence in Education (eds C.K. Looi, G. McCalla, B. Bredeweg & J. Breuker), pp. 555562. ISO Press, Amsterdam. Rocklin T. & Thompson J.M. (1985) Interactive effects of test anxiety, test difculty, and feedback. Journal of Educational Psychology 77, 368372. Roos L.L., Wise S.L. & Plake B.S. (1997) The role of item feedback in self-adapted testing. Educational and Psychological Measurement 57, 8598. Swaminathan H., Hambleton R.K., Sireci S.G., Xing D.H. & Rizavi S.M. (2003) Small sample estimation in dichotomous item response models: effect of priors based on judgmental information on the accuracy of item parameter estimates. Applied Psychological Measurement 27, 2751. Tai D.W.-S., Tsai T.-A. & Chen F.M.-C. (2001) Performance study on learning Chinese keyboarding skills using the adaptive learning system. Global Journal of Engineering Education 5, 153161. Van der Linden W.J. (1998) Bayesian item selection criteria for adaptive testing. Psychometrika 63, 201216. Van der Linden W.J. & Chang H.H. (2003) Implementing content constraints in alpha-stratied adaptive testing using a shadow test approach. Applied Psychological Measurement 27, 107120. Van der Linden W.J. & Glas C.A.W. (2000) Computerized Adaptive Testing: Theory and Practice. Kluwer, Norwell, MA. Van der Linden W.J. & Hambleton R.K. (1997) Handbook of Modern Item Response Theory. Springer, New York. Van der Linden W.J. & Veldkamp B.P. (2004) Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics 29, 273291. Van der Linden W.J. & Veldkamp B.P. (2007) Conditional item-exposure control in adaptive testing using

2010 Blackwell Publishing Ltd

562

K. Wauters et al.

item-ineligibility probabilities. Journal of Educational and Behavioral Statistics 32, 398418. Veerkamp W.J.J. & Berger M.P.F. (1997) Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics 22, 203226. Verd E., Regueras L.M., Verd M.J., De Castro J.P. & Prez M.A. (2008) An analysis of the research on adaptive learning: the next generation of e-learning. WSEAS Transactions on Information Science and Applications 5, 859868. Vispoel W.P. & Coffman D.D. (1994) Computerized-adaptive and self-adapted music-listening tests: psychometric features and motivational benets. Applied Measurement in Education 7, 2551. Wainer H. (2000) Computerized Adaptive Testing: A Primer. Erlbaum, London.

Wainer H. & Mislevy R.J. (2000) Item response theory, item calibration, and prociency estimation. In Computerized Adaptive Testing: A Primer (ed. H. Wainer), pp. 61100. Erlbaum, London. Wasson B. (1993) Automating the development of intelligent learning environments: a perspective on implementation issues. In Automated Instructional Design, Development and Delivery (ed. R.D. Tennyson), pp. 153170. Springer, Berlin. Wise S.L., Plake B.S., Johnson P.L. & Roos L.L. (1992) A comparison of self-adapted and computerized adaptive tests. Journal of Educational Measurement 29, 329 339. Yao T. (1991) CAT with a poorly calibrated item bank. Rasch Measurement Transactions 5, 141.

2010 Blackwell Publishing Ltd

Copyright of Journal of Computer Assisted Learning is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Anda mungkin juga menyukai