Poon Poon

Spaan Fellow Working Papers in Second or Foreign Language Assessment
Copyright 2010 Volume 8: 6994
English Language Institute University of Michigan www.lsa.umich.edu/eli/research/spaan
Expanding a Second Language Speaking Rating Scale for Instructional and Assessment Purposes
Kornwipa Poonpon Northern Arizona University
mandated oral language assessments in English language classrooms have prompted teachers to use published rating scales to assess their students speaking ability. The application of these published rating scales can cause problems because they are often so broad that they cannot capture the students improvement during a course. Bridging the gap between the teachers testing world and the scholars testing theories and practices, this study aimed to expand the TOEFL iBTs integrated speaking rating scale in order to include more fine-grained distinctions to capture progress being made by students. A sample of 119 spoken responses from two integrated tasks was analyzed to identify salient response features in delivery, language use, and topic development from (a) acoustic-based and corpus-based perspectives, and (b) raters perspectives. The acoustic-based results revealed speech rate and content-planning pauses were significant predictors of delivery performance. The corpus-based results revealed type/token ratio, proportion of low and high frequency words, error free c-units, stance adverbs, prepositional phrases, relative clauses, and passives as significant predictors of language use performance. Linking adverbials were found to be a predictor of topic development performance. These salient features in combination with the features verbally reflected by the raters, were used to inform the expansion of the speaking rating scale. Guided by empirically derived, binary-choice, boundary-definition (EBB) scales (Upshur & Turner, 1995), the score levels 2 and 3 in the speaking scale were then expanded to describe features of responses slightly below and above each original level. These detailed descriptors were presented as a rating guide made up of a series of choices, resulting in an expansion of scores to 1, 1.5, 2, 2.5, 3, 3.5, and 4 within delivery, language use, and topic development. Communicative approaches to language testing have led to the widespread use of oral language assessments in English as a second/foreign language (ESL/EFL) classroom contexts. The inclusion of oral language assessments in ESL/EFL classrooms seems to result in use of published rating scales by the teachers to assess their students speaking ability. Application of published rating scales might cause problems because the published rating scales are often too broad to capture the students actual ability or students language progress over a course (Upshur & Turner, 1995).
69
ABSTRACT Recently
70
K. Poonpon
As one of the most popular language tests, the Test of English as a Foreign Language Internet-based test (TOEFL iBT) has generated challenges in ESL/EFL teaching and testing because of the inclusion of assessment of integrated speaking skills (i.e., test takers ability to use spoken and/or written input to produce a spoken response). With the new task types and new rating scales, it is challenging for scale developers as well as score users to ensure that test scores derived from the test will be appropriately interpreted. In particular, TOEFL iBT scores should inform score users about (a) the extent to which the test takers are likely to perform on speaking tasks across university settings and (b) test takers weaknesses or strengths in communication for academic purposes. These score interpretations address two components of language assessment: score generalization and score utilization. The extent to which the scores can be generalized is associated with consistency of ratings. Research on the scoring of TOEFL iBTs speaking section has found that some raters have difficulty using the rating scales (Brown, Iwashita, & McNamara, 2005; Xi & Mollaun, 2006). In particular, raters cannot clearly distinguish among the three analytic dimensions (i.e., delivery, language use, and topic development) of the scale. Experienced raters may have a good feel for a general description category on the rating scale and then can simply assign one single score. However, novice raters who are trying to attend to the three dimensions before giving a holistic score may encounter problems that result in inconsistent ratings (Xi & Mollaun, 2006), thus low inter-rater reliability. Inconsistent ratings affect the generalization of the test scores. Score utilization is related to an expectation of the TOEFL iBT speaking test to help guide curricula worldwide (e.g., Butler, Eignor, Jones, McNamara, & Suomi, 2000; Jamieson, Jones, Kirsch, Mosenthal, & Taylor, 2000; Wall & Hork, 2006, 2008). Language teachers are being encouraged to use the TOEFL iBT speaking rating scales in their classes to raise students awareness of their abilities in relation to the TOEFL scales and to help them improve. However, the existing scales may not provide adequate guidance for the teachers in many pedagogical contexts. While holistic scores have the advantage of being both practical and effective for admissions decisions, they have a disadvantage in classroom situations because they do not provide detailed feedback. Also, the TOEFL rating scales may be simply too broad to capture the students actual ability (Clark & Clifford, 1988; Fulcher, 1996b), and if used in classroom settings may not sufficiently allow the teacher to indicate a students language progress. Purpose of the Study The TOEFL iBT speaking rating scales may need some modifications for instructional and rater training purposes. In an attempt to make the scales and their use more practical, this study aimed to expand the existing TOEFL iBT integrated speaking scale, from four score levels (i.e., 1, 2, 3, and 4) to seven levels (i.e., 1, 1.5, 2, 2.5, 3, 3.5, and 4), in order to include more fine-grained distinctions to make rating easier and more consistent and to capture progress being made by the speaker. Driven by literature related to approaches to scale development (see also Bachman & Palmer, 1996; Bachman & Savignon, 1986; Dandonoli & Henning, 1990; Fulcher, 1987, 1997; Lowe, 1986; North, 2000; Stansfield & Kenyon, 1992), this study used an empirical-based approach to expand the scale because it allowed both quantitative and qualitative data to be used to inform the scale expansion. Inspired by the work of Upshur and Turner (1999) and their empirically derived, binary-choice, boundarydefinition (EBB) scales, detailed descriptors were proposed as a rating guide with a series of binary choices, within the categories of delivery, language use, and topic development.
71
Research Questions The study was guided by three research questions. 1. What spoken features can distinguish examinees performances at the score levels 1, 1.5, 2, 2.5, 3, 3.5, and 4 for delivery, language use, and topic development? 2. What are the relationships between dimension scoring profiles and the ETS holistic scores? 3. Are the expanded score levels functioning appropriately for delivery, language use, and topic development? Methodology Participants The study included ten English language teachers. All were considered novice speaking raters as they had no formal training in how to use a speaking scale nor had training at least three years prior to the study. Five of them were male and five were female, with ages between 23 and 44. Seven were native speakers of English and three were nonnative English speakers whose first languages were Italian, Chinese, and Arabic. Eight were PhD students in applied linguistics and two were masters degree students in TESL in an American university. All had at least three years of experience teaching English, including ESL/EFL, English for academic purposes, and English composition. A Test-Based Spoken Corpus The test-based spoken corpus was taken from a public use data set provided by Educational Testing Service. The corpus consisted of 119 reading/listening/speaking (RLS) spoken responses to two tasks (Task 3 and Task 4). These responses represented two firstlanguage groups: Arabic and Korean. They included 20 responses (10 from each language group) from score levels 1 and 4 (40 total), 40 from score level 2 (20 from each language group), and 39 from score level 3 (only 19 available responses from Arabic and 20 from Korean). Because the focus of the study was at score levels 2 and 3, a smaller number of responses from levels 1 and 4 were needed to be used as the lower and upper boundaries of the scale. This corpus consisted of 13,570 total words. Data Collection Data collection was conducted during the summer of 2008. First, the raters were trained to use the TOEFL iBT integrated speaking rating scale by taking an ETS online scoring tutorial from the ETS Online Scoring Network (OSN) tutorial website (permission granted by ETS) focusing on the scoring training for RLS speaking tasks. The raters then were trained to use expanded scores and how to produce verbal protocols. Each of the 10 raters was to evaluate about 96 spoken responses (from a total of 952 scorings119 responses x 2 scorings x 4 dimensions). The responses were systematically arranged for double scoring across level and first-language group. The raters scored the responses independently at their convenience. The raters completed two parts while scoring. In the first part, they both scored and completed think-aloud reports. They listened to the first set of responses and scored each holistically, using expanded scores (i.e., 1, 1.5, 2, 2.5, 3, 3.5, or 4); they then recorded their verbal reports. The next set of responses was scored on delivery, with
72
K. Poonpon
each score followed by a verbal report. Then, a third set of responses was scored for language use, and each score was followed by a verbal report. The final set of responses was scored for topic development, followed by verbal reports. In the second part, they only scored, following the same steps as they did in the first part but without giving verbal reports. Data Analysis The study employed a mixed-method approach of data analysis. Both quantitative data and qualitative data were analyzed to triangulate each other in order to answer the research questions. The quantitative analysis was conducted for each of the two RLS tasks (Tasks 3 and 4), as required by the assumption of independent observations in multiple regression. Quantitative Data Quantitative data analyses involved two sets of data: occurrences of linguistic features in the spoken responses and the raters scores. The linguistic analyses were conducted to search for features, predicted in the literature, that distinguish performance across score levels for delivery, language use, and topic development (Table 1). For delivery, speech rate, filled pauses, and pause/hesitation phenomena were measured. Values for these variables served as independent variables which were regressed on the scores to support their relative importance in accounting for variance. For language use, transcripts of the spoken responses were used to analyze vocabulary range and richness as well as grammatical accuracy. They were also automatically tagged for lexico-grammatical features, using Bibers (2001) tag program, to examine specific (i.e., complement clauses, adverbial clauses, relative clauses, and prepositional phrases) and datadriven features reflecting grammatical complexity. Mean and standard deviation were calculated for occurring features. Features without counts were deleted from the list. For text comparison purposes, the counts of the tagged features were normalized to 100 words. Then stepwise multiple regression was computed to see which features influenced language use scores. The analysis for topic development was conducted through an investigation of cohesive devices to inform text coherence, using the concordancing software MonoConc. The framework of the analysis was based on Halliday and Hasans (1976) lexico-grammatical model of cohesion with some adjustments for clearer operational definitions of the conjunction and collocation relations. Stepwise multiple regression was computed to see which features influenced topic development scores. The raters scores were analyzed using both traditional and FACETS analyses. First, estimates of inter-rater reliability between the two trained raters were computed. In each dimension, discrepant responses were listened to by a third rater (the researcher) and the most frequent score was used. Descriptive statistics were computed for each dimension score and the holistic score to determine the degree to which the raters actually used the expanded scale. The FACETS analysis allowed the researcher to look at the meaningfulness of the expanded levels of the scale.
73
Table 1. Measures for Delivery, Language Use, and Topic Development Measures Operationalization Delivery 1. 2. 3. 4. 5. 6. 7. Speech rate Filled pauses Content planning pauses Grammatical/lexical planning pauses Adding examples/extra information pauses Grammatical/lexical repair pauses Propositional uncertainty pauses Vocabulary richness Number of syllables (excluding fillers) per real expressed time normalized to 60 seconds Number of filled pauses per real expressed time normalized to 60 seconds Mean number of content planning pauses produced in a given speech per 60 seconds Mean number of grammatical planning pauses produced in a given speech per 60 seconds Mean number of addition pauses in a given speech per 60 seconds Mean number of grammatical/lexical repair pauses in a given speech per 60 seconds Mean number of propositional uncertainty pauses in a given speech per 60 seconds Proportion of high- and low-frequency words used in a spoken text Type/token ratio in a spoken text Mean proportion of error free c-units in a spoken text Normalized counts of complement clauses in a spoken text* Normalized counts of adverbial clauses in a spoken text* Normalized counts of relative clauses in a spoken text* Normalized counts of prepositional phrases in a spoken text* Normalized counts of lexico-grammatical features in a spoken text* Normalized counts of reference devices (e.g., these, here, there) in a spoken text*
Language Use 8.
9. Vocabulary range 10. Grammatical accuracy Specific grammatical complexity 11. Complement clauses 12. Adverbial clauses 13. Relative clauses 14. Prepositional phrases 15. Data-driven (grammatical complexity) features Topic Development 16. Reference devices for cohesion
17. Conjunction devices for Normalized counts of linking adverbials (e.g., first, then, cohesion so, however) in a spoken text* Note. * Counts were normalized to 100 words.
74
K. Poonpon
Qualitative Data Data derived from think-aloud protocols were coded and analyzed, using Miles and Hubermans (1994) framework, by tasks, dimensions, and score levels. The researcher and her second coder underlined pieces of information that fall into any of the three dimensions of the scale, labeled the pieces of information according to dimension, characteristic, and score, and wrote down ideas about the labels and their relationships. The ideas were then systematized into a coherent set of explanations. The researcher and the second coder discussed themes, grouped related themes, and renamed the combined category. Results RQ1: What are the features which distinguish examinees performances for delivery, language use, and topic development? Spoken Features Distinguishing Examinees Delivery Performances The quantitative data analysis shows that speech rate and content planning pauses significantly predicted the examinees language performance for Task 3, accounting for 76% of the total variance in examinees delivery performance [R2 = .76, F(2, 55) = 87.45]. The positive relationships between these features (speech rate, b = .70; content planning pauses, b = .30) and the delivery scores indicate that the faster the speech rate and the more content planning pauses used in spoken responses, the higher the examinees delivery performance levels. For Task 4, only speech rate accounted for approximately 72% of the total variance in delivery scores [R2 = .72, F(1, 59) = 152.89] with a very strong influence (b = .85). It indicates that the more syllables produced by the examinees, the higher the examinees delivery performance levels. In combination with the qualitative analyses, Table 2 summarizes different features that were likely to distinguish examinees between adjacent score levels. For example, between score levels 1 and 1.5, slow speech rate and a small number of content planning pauses were used to distinguish among the speakers. Also the delivery features that were considered included pronunciation at word and phrasal levels and short responses (as predictors for level 1) as well as fluidity, hesitations, clarity of speech, intonation, speakers attempt to respond, and listener effort (as predictors of level 1.5). Between scores 3 and 3.5, speakers confidence in delivering speech, pauses for word search or content planning, and native-like pace and pauses may distinguish spoken responses. Native-like or natural delivery in general could be the features that can potentially distinguish spoken responses at a score of 3.5 from a score of 4. Spoken Features Distinguishing Examinees Language Use Performances For language use, the quantitative analyses revealed that type/token ratio, low/high frequency, relative clauses, and sum stance adverbs were the best predictors for Task 3, accounting for 58% of the total variance in examinees language use scores [R2 = .58, F(1, 53) = 17.97]. The positive relationships between these features (type/token ratio, b = .40; low/high frequency, b = .33; relative clauses, b = .30; and sum stance adverbs, b = .21) and the examinees language use performance signified that the more these language features were found in the speech, the higher the examinees scores in language use. The analysis for Task 4 shows that the significant predictors included prepositional phrases, error free c-units,
75
relative clauses, and passives, accounting for 54% of the total variance in examinees language use scores [R2 = .54, F(1, 56) = 16.59]. The positive relationships between the language use scores and these features (prepositional phrases, b = .29; error free c-units, b = .37; relative clauses (b = .37), and passives (b = .24) explained that the more these features were produced by the speakers, the higher language use scores they received. Together with the quantitative data, features driven by raters verbal reports were likely to distinguish between adjacent score levels (Table 3). For example, the features separating examinees from 1.5 and 2 included speakers attempt to use complex structures, despite no success, and the speakers lack of confidence in producing language. An automatic use of advanced vocabulary and a wide range of complex structures were characteristics that distinguished the speech samples that received scores 3.5 and 4. Table 2. Salient Features for Delivery at Adjacent Score Levels Score Quantitative results Qualitative results 1 Slower speech rate Pronunciation at word and phrasal levels A smaller number of content Short length of speech planning pauses Unintelligible speech 1.5 Not fluent Lots of hesitations Unclear speech Problematic intonation Speakers attempt to respond to the task A great deal of listener effort Choppy pace L1 influence Listener effort Repetition of words/phrases and use of false starts Unclear pronunciation Monotone speech Effects of use of fillers -- distracting and challenging the listener Speakers confidence in delivering speech Pauses for word search Nativeness or naturalness of intonation, pace, and pauses Faster speech rate A larger number of content planning pauses Nativeness or naturalness of overall delivery Pauses for information recall
2.5
3 3.5
76
K. Poonpon
Table 3. Salient Features for Language Use at Adjacent Score Levels Score Quantitative results Qualitative results 1 A smaller type/token ratio Length of speech A smaller proportion of low/high frequency A smaller number of error free cunits Less use of stance adverbs, prepositional phrases, relative clauses, and passives 1.5 Repetitions of words/phrases (including key words from the prompt) Speakers attempt to use complex structures, but not successful Speakers lack of confidence in producing language Use of repairs or self corrections Nominal substitution Ability to use developed complex structures Repetitions of sophisticated words Minor errors of vocabulary and grammar use A larger type/token ratio A larger proportion of low/high frequency A larger number of error free c-units More use of stance adverbs, prepositional phrases, relative clauses, and passives An automatic use of advanced vocabulary A wide range of complex structures
2.5 3 3.5
Spoken Features Distinguishing Examinees Topic Development Performances The quantitative analysis show that for Task 3 linking adverbials were a significant predictor, accounting for 7% of the total variance in topic development scores [R2 = .07, F(1, 56) = 4.06]. The positive relationship between this feature and the topic development scores (b = .26) indicated that the more linking adverbials were produced, the higher scores the examinees received for topic development. For Task 4, multiple regression could not be
77
computed because both linking adverbials and reference devices were not significantly correlated with the topic development scores. The qualitative analysis exhibit various features that may influence the raters while scoring topic development scores. In combination with the quantitative results, Table 4 summarizes the features that were likely to distinguish examinees performance on topic development. The feature distinguishing between scores 1 and 1.5, for example, was a serious lack of relevant information required by the task, poor connection of ideas, and text coherence. Inclusion of all key points required by the task and good synthesis of the prompt were likely to separate speakers at score level 3.5 from score level 4. Table 4. Salient Features for Topic Development at Adjacent Score Levels Score Quantitative results Qualitative results 1 ( Smaller number of linking Serious lack of relevant information adverbials) required by the task 1.5 2 2.5 3 Poor connection of ideas and text coherence Inaccuracy and vagueness of information Difficulty in developing speech and connecting ideas at times Speakers comprehension of the stimulus or task Some accuracy of ideas Inclusion of introduction and conclusion ( Larger number of linking adverbials) All key points required by the task Good synthesis of prompt
3.5 4
RQ2: What are the relationships between dimension scoring profiles and the ETS holistic scores? Relationships between dimension scoring profiles (using the expanded scores) and the ETS holistic scores (i.e., 1, 2, 3, and 4) were examined to (1) find different scoring profiles used by novice raters in scoring the 119 spoken responses receiving particular holistic scores, and (2) examine which among these profiles most contributed to each holistic score level. By listing dimension consensus scores of the responses classified into each ETS holistic score level, different profiles were drawn to illustrate how the consensus scores for individual dimensions of particular responses were identical, greater, or less than the holistic score. To report the profiles, frequency and percentage of the scoring patterns were used to reveal which one(s) among the emerging profiles most accounted for each ETS holistic score.
78
K. Poonpon
Before looking at the results, scoring guidelines for the TOEFL iBTs integrated spoken responses are restated here to help the reader understand the results to be reported. The scoring guidelines (Educational Testing Service, 2008) explain that a rater must consider all three performance features. To score at 4 (the highest level), a response must meet all three performance dimensions at that level. For example, a response that receives a score of 3 for delivery, 4 for language use, and 4 for topic development would receive a score of a 3 and not a 4. To score at 3, 2, or 1 level, a response must meet a minimum of two of the performance criteria at that level. For example, a response receiving a score of 3 for delivery, 2 for language use, and 2 for topic development would receive a score of 2. In the results for this research question, not only were different scoring profiles drawn, but the scoring profiles the raters gave were also compared with the expected scoring patterns addressed in the ETS scoring guidelines. The results report both expected scoring profiles (i.e., corresponding to the ETS scoring guidelines) and the unexpected profiles (i.e., not corresponding to the guidelines). Dimension Scoring Profiles for ETS Holistic Score 4 For the total of 20 spoken responses that received an ETS holistic score of 4, the raters scoring, when using the expanded scale, can be categorized into five scoring profiles (Table 5). Among these, profile A was the expected profile, showing that the raters assigned score 4 to six responses (30%) under all three performance dimensions. (The symbol = 4 represents correspondence of score 4 from the expanded scale to the ETS score guidelines, indicating that responses that received a score of 4 must meet all three performance features at the score level 4.) The other four profiles were the unexpected profiles as they did not correspond to the ETS guideline. For example, profile, B showed that the raters thought two of the performance features (i.e., delivery and language use) were equal to 4 (or = 4) while the topic development performance was judged to be less than 4 (or < 4). This pattern accounted for 20% of all responses. Table 5. Dimension Scoring Profiles for Holistic Score 4 Dimension Scoring Delivery Language Topic Profile Use Development A =4 =4 =4 B =4 =4 <4 C =4 <4 <4 D <4 <4 =4 E <4 <4 <4 Frequency (N = 20) 6 4 2 2 6 % (100) 30 20 10 10 30
Dimension Scoring Profiles for ETS Holistic Score 3 Scores assigned by the novice raters for the 39 spoken responses that received an ETS holistic score of 3 patterned into fifteen scoring profiles (Table 6). Among these fifteen scoring profiles, profiles A to F were the expected scoring profiles as pointed out in the ETS guidelines that a response at score 3 must meet a minimum of two of the performance criteria at the level (or at least two of the symbol = 3 in each profile). These six scoring profiles revealed the raters scores of 3 for at least two performance dimensions, accounting for nearly 26% of the holistic score level 3 (i.e., 10 responses out of a total of 39). Profiles G to I were also considered expected profiles because, as addressed in the ETS guidelines for scoring a 4, the raters did not assign a score of 4 for all three dimensions,
79
which broke the rule for score 4. Instead, the raters scored the responses a 4 for two dimensions and a 3 for one dimension; thus, these responses received a score of 3, not a 4. About 28% of the holistic score level 3 (i.e., 11 responses out of a total of 39) can be explained by these three scoring profiles combined. The remaining profiles (46%), Profiles J to O, were considered unexpected, as they did not follow the ETS guidelines. Table 6. Dimension Scoring Profiles for Holistic Score 3 Dimension Scoring Delivery Language Topic Profile Use Development A =3 =3 =3 B =3 =3 <3 C =3 =3 >3 D <3 =3 =3 E =3 >3 =3 F >3 =3 =3 G >3 =3 >3 H >3 >3 =3 I =3 >3 >3 J =3 >3 <3 K =3 <3 <3 L <3 =3 <3 M >3 =3 <3 N >3 >3 >3 O >3 <3 >3 Frequency (N = 39) 2 2 2 2 1 1 5 4 2 2 2 2 2 9 1 % (100) 5.13 5.13 5.13 5.13 2.56 2.56 12.82 10.26 5.13 5.13 5.13 5.13 5.13 23.07 2.56
Dimension Scoring Profiles for ETS Holistic Score 2 There were thirteen dimension scoring profiles for the 40 spoken responses that received the ETS holistic score of 2 (Table 7). Table 7. Dimension Scoring Profiles for Holistic Score 2 Dimension Scoring Delivery Language Topic Profile Use Development A =2 =2 =2 B =2 >2 =2 C >2 =2 =2 D =2 =2 >2 E =2 =2 <2 F =2 <2 =2 G >2 >2 =2 H >2 =2 >2 I <2 =2 <2 J =2 <2 <2 K >2 >2 >2 L >2 >2 <2 M <2 <2 <2 Frequency (N = 40) 2 3 2 1 4 1 6 2 2 1 12 3 1 % (100) 5 7.5 5 2.5 10 2.5 15 5 5 2.5 30 7.5 2.5
80
K. Poonpon
Six of these profiles (32.5%), profiles A to F, were expected scoring profiles because they corresponded to the ETS scoring guidelines for score level 2. In other words, in these six profiles, the raters assigned a score of 2 for at least two dimensions. These profiles accounted for 32.5% of the total responses at score level 2. Profiles G to M were unexpected at this score level. These profiles did not correspond to the ETS guidelines. That is, none of these profiles showed that the raters gave score 2 for at least two dimensions. The unexpected profiles accounted for 67.5% of the total responses (higher than the expected scoring profiles at this score level). Dimension Scoring Profiles for ETS Holistic Score 1 Five scoring profiles were found in the lowest score level (Table 8). Among these profiles, Profiles A, B, and C were congruent with the ETS guidelines for score level 1, accounting for 25% of the total responses at this holistic score level. They were similar to each other in that the responses in these profiles were assigned to a score of 1 for at least two performance dimensions. The other two profiles, profiles D and E, did not correspond to the ETS guidelines, thus were considered unexpected profiles. In both profiles, the raters thought that performance in at least two dimensions were better than a score of 1. The total responses at the holistic score 1 can be explained by profile D (30%) and by profile E (45%). Table 8. Dimension Scoring Profiles for Holistic Score 1 Dimension Scoring Delivery Language Topic Profile Use Development A =1 =1 =1 B =1 =1 <1 C =1 >1 =1 D >1 >1 =1 E >1 >1 >1
Frequency (N = 20) 3 1 1 6 9
% (100) 15 5 5 30 45
RQ3: Are the expanded score levels functioning appropriately for delivery, language use, and topic development? This section reports results from the inter-rater reliability analysis for double ratings for holistic and dimension scores, followed by results from the FACETS analysis. As shown in Table 9, the inter-rater reliability analysis showed acceptable criterion reliability values (i.e., ranging from r = 0.69 to r = 0.88) (Brown & Hudson, 2002). For both tasks, holistic scoring showed the highest inter-rater reliability (r = 0.81), followed by delivery and language use (r = 0.78) and topic development (r = 0.75). Table 9. Inter-rater Reliability by Task Types and Score Dimensions Score Dimensions Task 3 Task 4 (N = 58) (N = 61) Overall/Holistic 0.88* 0.74* Delivery 0.81* 0.74* Language Use 0.80* 0.75* Topic Development 0.81* 0.69* * p < 0.05 Both Tasks (N = 119) 0.81* 0.78* 0.78* 0.75*
81
The results from the FACETS analysis on the functioning of the expanded score levels for individual dimensions was based on (1) average measures, (2) Most Probable from thresholds, and (3) mean-square outfit statistics. Functioning of the Expanded Score Levels for Delivery The FACETS result shows that the average measure values were ordered from the lowest (-4.04) to the highest (5.25) when the score levels increased (Table 10). All of the Most Probable from thresholds for the rating scale increased as the score levels increased. These patterns indicate that the score levels for delivery were ordered appropriately. The score levels were distinguishable. Most of the mean-square outfit statistics were within the acceptable range of 0.5 to 1.5, except for the score level 4 with the outfit mean-square value of 1.6, suggesting that there could be an unexpected component in the ratings. Other than that, each score level was contributing to meaningful measurement as they were intended to distinguish degrees of performance. Table 10. Score Levels Statistics for Delivery Score Levels Avge Measure Most Probable From OUTFIT Mnsq 0 -4.04 Low 0.7 1 -2.67 -5.60 1.2 1.5 -2.18 -2.73 0.7 2 -0.72 -1.81 0.5 2.5 0.93 -0.28 0.9 3 2.79 1.65 0.8 3.5 4.38 3.65 0.8 4 5.25 5.12 1.6 Note. Low shows the most probable score level at the low end of the scale. Score level 0 was present in the data but not included in the report due to a very small number (i.e., only one zero found in delivery). Functioning of the Expanded Score Levels for Language Use Table 11 displays the higher Average Measures that corresponded to the higher score levels for the language use dimension. Table 11. Score Levels Statistics for Language Use Score Levels Avge Measure Most Probable From OUTFIT Mnsq 0 -4.32 Low 0.80 1 -3.44 -5.16 0.80 1.5 -2.51 -3.63 1.20 2 -0.60 -2.14 0.80 2.5 0.80 0.21 0.80 3 2.82 1.39 1.10 3.5 4.22 3.92 0.90 4 5.45 5.41 0.90 Note. Low shows the most probable score level at the low end of the scale. Score level 0 was present in the data but not included in the report due to a very small number (i.e., only three zeros found in language use).
82
K. Poonpon
All of the Most Probable from thresholds values also show correspondence between the higher values and the higher score levels. These phenomena suggest that the score levels were functioning appropriately to differentiate degrees of performance. All of the meansquare outfit statistics appeared to be in the acceptable range of 0.50 and 1.50, with the lowest value at 0.80 and the highest value at 1.20. This evidence indicates that the score levels were used meaningfully to measure speaking performance as they were expected to. Functioning of the Expanded Score Levels for Topic Development Similar to the evidence found in the delivery and language use dimensions, the score levels tended to be functioning properly for topic development (Table 12). This can be illustrated by the fact that the Average Measures and the Most Probable from Thresholds were ordered from the lowest to the highest when the score levels increased. In addition, the mean-square outfit statistics show that all of the values (ranging from 0.80 to 1.30) were in the acceptable range of 0.50 to 1.50. This suggested that the score levels were meaningfully contributing to measurement in topic development. Table 12. Score Levels Statistics for Topic Development
Score Levels 0 1 1.5 2 2.5 3 3.5 4 Avge Measure -4.79 -2.76 -1.42 -0.08 0.64 2.12 3.28 3.89
Note. Low shows the most probable score level at the low end of the scale. Score level 0 was present in the data but not included in the report due to a very small number (i.e., only six zeros found in topic development). Discussion and Conclusion The findings are discussed in four sections: (1) dimension scoring profiles in comparison to the ETS holistic scores, (2) functioning of the expanded score levels, (3) spoken features that distinguished test takers performances in the delivery, language use, and topic development dimensions, and (4) how the evidence from these topics leads to an expansion of the TOEFL iBTs integrated speaking scale and development of a rating guide for the three dimensions. Dimension Scoring Profiles and the ETS Holistic Scores The analysis of scoring profiles from emerging data in each dimension shows interesting relationships between dimension scoring profiles and the ETS holistic scores and the extent these profiles contributed to each holistic score level. Using the ETS scoring guidelines as criteria for scoring profiles, different scoring profiles were found for each holistic score level. Among these profiles, the expected profiles for all holistic score levels,
Most Probable From Low -5.34 -2.12 -0.90 0.66 0.86 3.30 3.55
OUTFIT Mnsq 0.80 1.30 1.00 0.80 0.90 0.80 0.80 1.10
83
except the holistic score 3, accounted for a smaller percentage of contribution to the holistic scores of the total responses than the unexpected profiles. This indicates that when raters attended to individual dimensions separately, they concentrated on each dimension freely. In other words, their scores were not masked by the other dimensions while assigning holistic scores (Brown, Iwashita, & McNamara, 2005). Although there were a small number of expected scoring profiles, dimension scoring manifests the possibility for the raters to score the TOEFL iBTs spoken responses both analytically and holistically. The holistic scoring (as currently used with the TOEFL iBTs speaking scale) is appropriate in many situations, where there are a large number of test takers and the test scores are used for admissions into higher education institutions. On the other hand, analytic scoring should be considered in low-stakes situations which focus on, for example, diagnosis or test takers improvement and feedback (e.g., in classroom settings). In this regard, the expanded scale with analytic scoring for each dimension could be a venue to allow novice raters or teachers to use the scale more easily and effectively in terms of indicating a students language progress. The differences between the dimension scores and the ETS holistic scores raise an important issue regarding scoring reliability. The score differences support the development of rating guides for all three dimensions to help improve the inter-rater reliability. Once the inter-rater reliability is improved, the test users can be more certain that the scores are used appropriately and clearly as intended. The clear and appropriate use of test scores then contributes to utilization of scores (e.g., how the scores inform what should be done in English curricula or language programs), which is one of the important components of building TOEFL iBT validity argument. Functioning of the Expanded Score Levels for the Three Dimensions As the purpose of this study was to expand the TOEFL iBTs integrated speaking rating scale from four score levels (i.e., 1, 2, 3, and 4) to seven levels (i.e., 1, 1.5, 2, 2.5, 3, 3.5, and 4), it was necessary to explore whether or not the idea of using the seven levels in the expanded scale is plausible. The FACETS analysis provided evidence that generally confirmed appropriate functioning of the seven score levels for delivery, language use, and topic development. The seven expanded score levels have the potential to be used to distinguish test takers performance in the three dimensions. The appropriate functioning of the expanded scores in combination with high interrater reliability in each dimension indicated that the novice raters in this study were able to use the expanded scores consistently and to distinguish the score levels clearly. This not only supports the expansion of the speaking scale, but also signifies the potential for language teachers who are not familiar with scoring students speaking performance and for raters who are newly hired to be trained to use the seven expanded score levels. Spoken Features Distinguishing Test Takers Performances In reconciling the findings of the acoustic, linguistic, and verbal protocol analyses, it was found that a combination of features could predict examinees speaking performance on delivery, language use, and topic development across score levels and tasks. The following sections discuss the salient features in each dimension followed by how these features were used to inform expansion of the TOEFL iBTs integrated speaking rating scale with the development of new rating guides for delivery, language use, and topic development.
84
K. Poonpon
Spoken Features for Delivery Speech rate was found to be a significant predictor for Tasks 3 and 4. Content planning pauses were a significant predictor in Task 3. As the significant predictor, speech rate supported evidence in previous studies on fluency in spoken language showing that speech rate was one of the best variables to distinguish between fluent and disfluent speakers (e.g., Derwing & Munro, 1997; Ejzenber, 2000; Freed, 2000; Kang, 2008; Kurmos & Dnes; 2004). The evidence of content planning pauses supports Fulchers findings from validation of a rating scale of oral language fluency (Fulcher, 1993, 1996a). The positive relation between these features and the delivery scores indicated that speakers with higher scores tended to speak faster and use more pauses to plan their content than those with lower scores. Insights obtained from the raters verbal protocols were also considered. The salient spoken features from the qualitative data reflect the features that the raters most often attended to at each score level. For delivery, at the lower end (i.e., scores 1 to 2), the raters paid attention to their difficulty in understanding the spoken responses. The understanding difficulty was usually caused by the speakers inability to sustain speech, poor pronunciation, limited intonation, very choppy pace, a lot of hesitations, and L1 influence. At the higher end (i.e., scores 2.5 to 4), the raters had a better understanding of the responses but still mentioned some speakers problems including use of repeated words, false starts, and fillers. The raters also turned their attention to the speakers ability to deliver the responses naturally. As the goal of the study was to expand the TOEFL iBTs integrated speaking scale with the development of new rating guides, the salient features derived from both acoustic and protocol analyses that tended to be in line with the TOEFL iBT description should be considered for inclusion in the expanded scale and rating guide. It is valuable when one feature drawn from one analysis was found to be consistent with another feature from another analysis. In the delivery dimension, for example, content planning pauses (from acoustic analysis) and pauses to recall information or content of the task (from raters protocol analysis) were consistent. This feature would then be included in the rating guide to distinguish speakers delivery performance. The following explanation focuses on how the delivery salient features were used in development of a rating guide for delivery. A Proposed Rating Guide for Delivery Based on the quantitative and qualitative results of the study as well as evidence of the appropriate functioning of the expanded score levels in the delivery dimension, a new rating guide using seven binary questions in three levels is proposed (Figure 1). The development of the rating guide for delivery followed five steps, as suggested by Upshur & Turner (1995). First, the salient delivery features from both acoustic features and verbal protocols for responses at score levels 2 and 2.5 were divided into the better and poorer. In this case, speech rate and the raters understanding of speech distinguish between the better and the poorer. Second, a criterion question was formulated to classify performances as upper-half or lower-half. This question was used as the first question in the rating guide (question 1), asking if the speaker produces understandable speech at a natural speed. Third, binary questions were developed, beginning with the upper-half. Working with the upper-half salient features, the features found in score levels 4 and 3.5, but not in 3 and 2.5 were selected. Here the clarity of speech and varied intonation were outstanding in the higher levels whereas the opposite qualities of speech were present in the lower ones. Thus, the criterion question asks if the speaker has clear pronunciation and varied intonation to distinguish the speakers from
85
4 and 3.5 from 3 and 2.5 (question 2a). Fourth, at the third level of the rating guide two criterion questions were formulated to discriminate score levels 4 from 3.5 and 3 from 2.5. By using the feature that is dominant in score 4 but not in 3.5, the question is Does the speaker use pauses naturally (to recall information or plan content) before continuing his/her speech (question 3a). A similar method was applied to distinguish the speakers in levels 3 and 2.5. The salient feature found in score 3 but not in 2.5 was used to generate a criterion question for these scores (question 3b). Because the speakers use of fillers, repetition of words, and false starts were found in score 3 but not in 2.5, the question was posed as Does the speaker show his/her hesitancy by occasionally using fillers but with few repetitions and false starts? Fifth, steps 3 and 4 were repeated for the lower-half performances. This time the lowest score level of the TOEFL scale (i.e., 0) was included to complete the scale. Question 2b was formulated using the salient feature that was found in score levels 2 and 1.5, but rare in 1 and 0. Thus, the question asks if some listener effort is required. Then criterion questions were formulated to distinguish score level 2 from 1.5 and score 1 from 0. Question 3c was based on the dominant feature in score 2 (i.e., problematic pronunciation with a little L1 interference) but not in 1.5. Question 3d addressed the feature that made score 1 more salient than score 0 (i.e., speakers attempt to respond with partial intelligibility). Delivery Level 1
1) Does the speaker produce understandable speech at a natural speed? Yes 2a) Does the speaker have clear pronunciation and varied intonation? Yes No No 2b) Is only some listener effort required?
Level 2
Yes
No
Level 3
3a) Does the speaker use pauses naturally (to recall information or plan content) before continuing his/her speech? Yes 4 No 3.5 3b) Does the speaker show his/her hesitancy by occasionally using fillers but with few repetitions or false starts? Yes 3 No 2.5 2 3c) Does the speaker have problematic pronunciation but with a little L1 influence? 3d) Is the speakers attempt to respond partially intelligible?
Yes
No 1.5
Yes 1 Lower-half
No 0
Upper-half
Figure 1. A proposed rating guide for delivery.
86
K. Poonpon
Spoken Features for Language Use The analyses of theoretically-motivated and data-driven linguistic features for Tasks 3 and 4 yielded different predictors of the examinees oral performance in language use. For Task 3, type/token ratio, proportion of low and high frequency words, relative clauses, and stance adverbs were significant predictors. Error free c-units, prepositional phrases, relative clauses, and passives significantly predicted language use performance for Task 4. These results, similarly found by Brown, Iwashita, and McNamara (2005), indicate that task differences may affect language used by the speakers. In particular, different prompts or topics may result in the use of different linguistic features. Despite the fact that speakers used different linguistic features in different tasks, it is reasonable to combine the predictors from both tasks as one set of predictors for language use performance since the goal of the study was to develop one rating guide for the language use dimension. Among the seven predictors of language use performance, three are word-level measures, another set of three represent features at phrasal and clausal levels, and one is related to grammatical accuracy. Type/token ratio and proportion of low and high word frequency are related to the use of words in spoken responses whereas stance adverbs reflect particular lexical features of linguistic complexity. Prepositional phrases, relative clauses, and passives represent the use of complex structures. Error free c-units address accuracy of language use. The finding of type/token ratio contradicted the evidence found in a study by Daller, Van Hout, and Treffers-Daller (2003). The result found in this study showed that type/token ratio was larger for higher-level examinees than low-level examinees. Conversely, Daller, Van Hout, and Treffers-Daller found significantly larger ratios for lower-level speakers than higher-level speakers. Despite these different results, it is suggested that type/token ratio can be used to measure lexical richness of texts (Vermeer, 2000). Another predictor related to vocabulary use, proportion of low and high frequency words, was also found in Vermeers study (2000) on lexical richness in spontaneous speech. Vermeers findings suggested that analysis of different levels of lexical frequency by distinguishing basic (or high frequency) and advanced (or low frequency) levels of lexical items used by speakers can be utilized to measure lexical richness of speakers with different ability levels (Laufer & Nation, 1995). For particular lexical features, stance adverbs represent data-driven lexico-grammatical structures for linguistic complexity in this study. The use of stance adverbs signifies that the lexicogrammatical analysis allows the researcher to look at variability of language use at different levels, especially where language learners are likely to produce short language at word or phrasal levels (Poonpon, 2007; Rimmer, 2006). The analysis of grammatical complexity motivated by the literature revealed prepositional phrases and relative clauses that represented high-level complexity as significant predictors. Usually these two structures occur more frequently in written language than in spoken language (Biber, Johansson, Leech, Conrad, & Finnegan, 1999); and grammatical features used in written language are likely to be more complex (Bygate, 2002; Norrby & Hkansson, 2007). In particular, Biber, Gray, and Poonpon (2009) have argued that prepositional phrases and relative clauses function as constituents in noun phrases. Occurrences of these two structures thus signify complex noun phrases (i.e., non-clausal features embedded in noun phrases); these embedded phrasal features represent a considerably higher degree of production complexity (Biber, Gray, & Poonpon , 2009, p. 22). When the speakers used more complex structures that are common in written language,
87
this showed their ability to construct a higher-level of complex grammatical features (Biber, Gray, & Poonpon, 2009). In other words, higher-level speakers are likely to use more complex structures in their oral language than lower-level speakers (Lennon, 1990). The finding of the study supports this theoretical position because the use of prepositional phrases and relative clauses was found to distinguish among examinees with high and low proficiency. It also promotes subordinated clauses (Biber, Gray, & Poonpon, 2009; Foster & Skenhan, 1996; Norrby & Hkansson, 2007), particularly dependent clauses functioning as constituents in a noun phrase, and phrasal dependent structures functioning as constituents in noun phrases (Biber, Gray, & Poonpon, 2009; Rimmer, 2008) as plausible measures of grammatical complexity for oral discourse. These complexity measures allow researchers to capture complex structures at non-clausal levels. The finding of error free c-units agrees with the studies on scoring of the TOEFL iBT speaking test (Brown, Iwashita, and McNamara, 2005; Iwashita, Brown, McNamara, & OHagan, 2007). These studies use error free c-units to measure global accuracy of test takers language use. They recommended this global measure for measuring grammatical accuracy because error free c-units were found to significantly predict test takers ability to use oral language. In addition to the findings from the linguistic analysis, the data from the raters protocols were considered. At the lower ends (i.e., scores 1 to 2), the raters attended to the speakers lack of vocabulary and grammar leading to use of basic vocabulary and structures. They also mentioned a lot of grammatical errors. These features were found less at the higher scores (i.e., scores 2.5 and 3). Obviously at scores 3.5 and 4, the raters addressed more accurate use of vocabulary and grammar as well as a wider range and control of advanced vocabulary and complex structures. Some of the salient features corresponded to the significant predictors found in the prior analysis of grammatical complexity for language use. Different complex structures were mentioned at different score levels (i.e., relative clauses and prepositional phrases in level 2, that complement clauses and if-then in level 2.5, passive voice in level 3, adverbial clauses in level 3.5, and adverbs, that complement clauses, noun clauses, and relative clauses in level 4). The raters attention to the use of relative clauses by the examinees agreed with the finding from the quantitative data. Moreover, another salient feature mentioned by the raters that concurred with the results from the linguistic analysis includes use of advanced vocabulary by the speakers at score level 4. This issue was congruent with the finding of proportion of low and high frequency words. The raters comments on vocabulary use at the word level in score level 1.5 were well-matched with another predictor, lexical verbs, found in the linguistic analysis. These consistent features were valuable for development of the rating guide because they can distinguish examinees oral language performance on language use. The insights from raters verbal reports and the findings from the linguistic analysis were used in developing the rating guide. A Proposed Rating Guide for Language Use A rating guide for language use (Figure 2) was proposed based on the evidence from quantitative and qualitative results as well as the appropriate functioning of the expanded score levels for language use. Following suggestions in developing a binary-question scale (Upshur & Turner, 1995), the development contained five steps (the same steps as described for delivery). First, the salient language use features from both linguistic analysis and verbal
88
K. Poonpon
protocols for responses at score levels 2 and 2.5 were divided into the better and poorer. Second, the criterion question was then formulated to classify performances as upper-half or lower-half. Because complex structures and advance vocabulary could distinguish between the better and the poor, this question asks, Does the speaker use complex structures and advanced vocabulary (question 1). Language Use Level 1
1) Does the speaker use complex structures and advanced vocabulary?
Yes
No 2b) Is the speaker able to construct grammar at sentence level? No
Level 2
2a) Does the speaker use vocabulary and grammar with minor or no errors? Yes No
Yes
Level 3
3a) Does the speaker have strong control and wide range of complex structures & advanced vocabulary? 3b) Does the speaker have limited range of complex structures & vocabulary, but it does not interfere with communication? Yes 3 Upper-half No 2.5 3c) Do the speakers range and control of grammar & vocabulary lead to some accuracy of language use? 3d) Do the speakers range and control of grammar and vocabulary help him/her to produce isolated words or short utterances? Yes 1 Lower-half No 0
Yes 4
No 3.5
Yes 2
No 1.5
Figure 2. A proposed rating guide for language use. Third, binary questions were developed, beginning with the upper-half. The salient features that were present in score levels 4 and 3.5, but not in 3 and 2.5 were selected to distinguish the speakers at these scores. Here the ability to use vocabulary and grammar with accuracy was prevalent in the higher levels. Thus, the criterion question was posted, Does the speaker use vocabulary and grammar with minor or no errors (question 2a). Moving to the third level of the rating guide, two criterion questions were formulated to discriminate score levels 4 from 3.5 and 3 from 2.5. Question 3a is based on a feature that is outstanding in score 4 but not in 3.5. Thus, the question asks, Does the speaker have strong control and
89
wide range of complex structures and advanced vocabulary? Similarly, question 3b used the salient feature found in score 3 but not in 2.5 to generate a criterion question. The question was then posted as Does the speaker have limited range of complex structures and vocabulary, but it does not interfere with communication? Steps 3 and 4 were subsequently repeated for the lower-half performances. The lowest score level of the TOEFL scale (i.e., 0) was included to complete the scale. Question 2b was formulated using the salient feature that was found in score levels 2 and 1.5, but rare in 1 and 0. The question asks if the speaker is able to construct grammar at sentence levels. Two more criterion questions were formulated to distinguish the speakers at score level 2 from 1.5 and score 1 from 0. Based on the dominant feature in score 2 not in 1.5, Question 3c asks whether the speakers range and control of grammar and vocabulary show some accuracy of language use. Question 3d addressed the feature that made score 1 more salient than score 0 (i.e., ability to produce speech at word level), posting whether or not the speakers range and control of grammar and vocabulary help the speaker to produce isolated words or short utterances. Spoken Features for Topic Development The linguistic analysis of spoken responses revealed that only linking adverbials were likely to be predictors of the examinees performance on topic development in Task 3 and not in Task 4. This finding corresponds to corpus-based facts regarding common uses of linking adverbials in spoken language (Biber, Johansson, Leech, Conrad, & Finnegan, 1999). This finding also shared similarities with Ejzenbergs study (2000) in that the speakers who received high speaking scores had a higher proportion of the cohesive devices used to link and organize their monologic talks. Thus, high-level speakers were better able to use coordinating conjunctions and adverbial conjunctions to provide continuity in their speech. Despite this result, linking adverbials accounted for a small proportion (only 7%) of the total variance in topic development performance. Moreover, reference devices did not show any significant relationship or predictive ability in either task. This evidence indicated that the examinees rarely used cohesive devices to connect ideas in their spoken texts. The scarce use of cohesive devices found in the study might reflect linguists thoughts that coherence does not reside in the text, but in the meaning of a message conveyed between the text and its listener or reader (e.g., Brown & Yule, 1983; Stoddard, 1991). Because linking adverbials hardly contribute to the score variance for topic development, the raters protocols were mostly relied on. The information obtained from the raters showed that the outstanding features at the lower end (i.e., scores 1 to 2) included lack of relevant information required by the task, undeveloped ideas, poor connection of ideas and text coherence, and inaccuracy and vagueness of information. The salient features at the higher end (i.e., scores 2.5 to 4) involved better qualities of topic development performance in terms of inclusion of more relevant information, clear connection and organization of ideas, and accuracy of information. A few salient features were not explicitly included in the original TOEFL scale. These features include speakers comprehension of the stimulus or prompt at score 2, inclusion of introduction and conclusion at score 3.5, and speakers ability to synthesize the prompt at score 4. The raters attention to other features apart from those in the scale signified either their different interpretations of the scale description or distraction by other features relevant to the explicitly stated description. In particular, some raters might interpret a clear progression of ideas as inclusion of introduction and conclusion parts in the spoken
90
K. Poonpon
responses. It also might be unavoidable for some raters to refer to the examinees ability to understand the prompt when they heard some incorrect information in the spoken responses. A Proposed Rating Guide for Topic Development The proposed rating guide for topic development (Figure 3) was informed by the evidence from qualitative data and the appropriate functioning of the expanded score levels. Similar to the rating guides for delivery and language use, binary question ideas and development steps (Upshur & Turner, 1995) were applied to the development of the rating guide as follows. Topic Development Level 1
1) Does the speaker produce most key ideas and relevant information required by the task? Yes 2a) Does the speaker show clear connection and progression of ideas with introduction and conclusion? Yes No No 2b) Does the speaker make some connections of ideas, though it is poor?
Level 2
Yes
No
Level 3
3a) Is the response fully developed with complete ideas and enough detail? 3b) Does the speaker produce some accurate ideas? 3c) Does the speaker give some detailed information? 3d) Does the speaker produce at least one idea although it is inaccurate, vague, irrelevant, or repetitive to the prompt? Yes 1 No 0
Yes 4
No 3.5
Yes 3
No 2.5
Yes 2
No 1.5
Upper-half
Lower-half
Figure 3. A proposed rating guide for topic development.
91
To create the first criterion question, the salient features from verbal protocols for responses at score levels 2 and 2.5 were selected to classify performances as upper-half or lower-half. The features that could generally distinguish between the better and poorer involved inclusion of key ideas and relevant information in the responses. The first criterion question was formulated to ask Does the speaker produce key ideas and relevant information required by the task? (question 1). Subsequently, binary questions were developed, beginning with the upper-half. The salient features that were present in score levels 4 and 3.5, but not in 3 and 2.5 were selected to distinguish the speakers at scores 4 and 3.5 from those at scores 3 and 2.5. At the scores 3.5 and 4, the outstanding features were related to clear connection of ideas including the use of introduction and conclusion in the responses. Question 2a then asked Does the speaker show clear connections and progression of ideas with introduction and conclusion. After that, at the third level of the rating guide, two criterion questions were formulated to discriminate score levels 4 from 3.5 and 3 from 2.5. Question 3a was related to the feature that was outstanding in score 4 but not in 3.5. The question asks, Is the response fully developed with complete ideas and enough detail? Question 3b was generated using the salient feature found in score 3 but not in 2.5. The question was then posted as Does the speaker produce some accurate ideas? Working with the lower-half performances, question 2b was formulated using the salient feature that was present in score levels 2 and 1.5, but rare in 1 and 0. The question asks if the speaker is able to produce a response with some connection of ideas despite its poor quality. Then, two more criterion questions were formulated to distinguish the speakers at score level 2 from 1.5 and score 1 from 0. Based on the dominant feature in score 2 not in 1.5, Question 3c asks, Does the speaker give some detailed information? Question 3d addresses the feature that differentiate the speakers at score 1 from score 0. It was posed to ask whether or not the speaker produces at least one idea although it is inaccurate, vague, irrelevant, or repetitive to the prompt. To conclude, the expanded scale with the rating guides provides evidence for empirically-based approach to scale development for oral language assessment to enhance the TOEFL iBTs goal at strengthening the link between the test and test preparation of test takers in their contexts such as classrooms or school systems. Application of the rating guides for instructional and assessment purposes can become a springboard for future research on washback of TOEFL in language instruction worldwide. It also supports one of TOEFL iBTs assumptions: promoting a positive influence on English language instruction. Acknowledgements I would like to express my gratitude to the English Language Institute at the University of Michigan for providing the funding for this study. I am also grateful for the funding and a TOEFL iBT data set provided by the Educational Testing Service (ETS).
92
K. Poonpon
References Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. Bachman, L. F., & Savignon, S. J. (1986). The evaluation of communicative language proficiency: A critique of the ACTFL oral interview. The Modern Language Journal, 70(3), 380390. Biber, D. (2001). Codes for counts in ETS program tag count. Unpublished document. Biber, D., Gray, B., & Poonpon, K. (2009). Measuring grammatical complexity in L2 writing development: Not so simple? Manuscript submitted for publication. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finnegan, E. (1999). Longman grammar of spoken and written English. Essex: Pearson Education Limited. Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English-for-Academic-Purposes speaking tasks. (TOEFL Monograph Series MS-29). Princeton, NJ: Educational Testing Service. Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University Press. Brown, G., & Yule, G. (1983). Discourse analysis. Cambridge: Cambridge University Press. Butler, F. A., Eignor, D., Jones, S., McNamara, T., & Suomi, B. K. (2000). TOEFL 2000 speaking framework: A working paper. (TOEFL Monograph Series MS-20). Princeton, NJ: Educational Testing Service. Bygate, M. (2002). Speaking. In R. B. Kaplan (Ed.), The Oxford handbook of applied linguistics (pp. 2738). Oxford: Oxford University Press. Clark, J. L. D., & Clifford, R. T. (1988). The FSI/ACTFL proficiency scales and testing techniques: Development current status, and needed research. Studies in Second Language Acquisition, 10(2), 129147. Daller, H., Van Hout, R., & Treffers-Daller, J. (2003). Lexical richness in the spontaneous speech of bilinguals. Applied Linguistics, 24(2), 197222. Dandonoli, P., & Henning, G. (1990). An investigation of the construct validity of the ACTFL proficiency guidelines and oral interview procedure. Foreign Language Annals, 23, 131151. Derwing, T. M., & Munro, M. J. (1997). Accent, intelligibility, and comprehensibility. Studies in Second Language Acquisition, 20(1), 116. Educational Testing Service. (2008). Online scoring network. Princeton, NJ: Author. Retrieved June 3, 2008, from http://learnosn.ets.org/ Ejzenberg, R. (2000). The juggling act of oral proficiency: A psycho-sociolinguistic metaphor. In H. Riggenbach (Ed.), Perspectives on fluency (pp. 287313). Ann Arbor: The University of Michigan Press. Foster, P., & Skehan, P. (1996). The influence of planning and task type on second language performance. Studies in Second Language Acquisition, 18(3), 299323. Freed, B. F. (2000). Is fluency, like beauty, in the eyes (and ears) of the beholder? In H. Riggenbach (Ed.), Perspectives on fluency, (pp. 243265) Ann Arbor: The University of Michigan Press. Fulcher, G. (1987). Tests of oral performance: The need for data-based criteria. ELT Journal, 41, 287291.
93
Fulcher, G. (1993). The construct validation of rating scales for oral tests in English as a foreign language. Unpublished PhD thesis, University of Lancaster, United Kingdom. Fulcher, G. (1996a). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208238. Fulcher, G. (1996b). Invalidating validity claims for the ACTFL oral rating scales. System, 24, 163172. Fulcher, G. (1997). The testing of speaking in a second language. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education (pp. 75-85). Norwell, MA: Kluwer Academic Publishers. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Iwashita, N., Brown, A., McNamara, T., & O'Hagan. (2007). Assessed levels of second language speaking proficiency: How distinct? Abstract retrieved January 10, 2008, from http://applij.oxfordjournals.org/cgi/content/abstract/amm017v1 Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TOEFL 2000 framework: A working paper. (TOEFL Monograph Series MS-16). Princeton, NJ: Educational Testing Service. Kang, O. (2008). The effect of rater background characteristics on the rating of International Teaching Assistants Speaking Proficiency. Spaan Fellow Working Papers, 6, 181 205. Kormos, J., & Dnes, M. (2004). Exploring measures and perceptions of fluency in the speech of second language learners. System, 32(2), 145164. Laufer, B., & Nation, I. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307322. Lennon, P. (1990) Investigating fluency in EFL: A quantitative approach. Language Learning, 40, 387-412. Lowe, P. Jr. (1986). Proficiency: Panacea, framework, process? A reply to Kramsch, Schulz, and, particularly, to Bachman and Savignon. The Modern Language Journal, 70(3), 391397. Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded source book. Thousand Oaks, CA: Sage. Norrby, C., & Hkansson, G. (2007). The interaction of complexity and grammatical processability: The case of Swedish as a foreign language. International Review of Applied Linguistics, 45(1), 4568. North, B. (2000). Scale for rating language proficiency: Descriptive models, formulation styles, and presentation formats. (TOEFL Research Paper). Princeton, NJ: Educational Testing Service. Poonpon, K. (2007). FACETS and corpus-based analyses of a TOEFL-like speaking test to inform speaking scale revision. Unpublished manuscript. Rimmer, W. (2006). Measuring grammatical complexity: The Gordian knot. Language Testing, 23(4), 497519. Rimmer, W. (2008). Putting grammatical complexity in context. Literacy, 42, 2935. Stansfield, C. W., & Kenyon, D. M. (1992). The development and validation of a simulated oral proficiency interview. The Modern Language Journal, 76(2), 130141. Stoddard, S. (1991). Text and texture: Patterns of cohesion. New Jersey: Ablex. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49, 312.
94
K. Poonpon
Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: Test method and learner discourse. Language Testing, 16(1), 82116. Vermeer, A. (2000). Coming to grips with lexical richness in spontaneous speech data. Language Testing, 17(1), 6583. Wall, D., & Hork, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 1, the baseline study. (TOEFL Monograph Series MS-34). Princeton, NJ: Educational Testing Service. Wall, D., & Hork, T. (2008). The impact of changes in the TOEFL examination on teaching and learning in central and eastern Europe: Phase 2, coping with change. (The TOEFL iBT Research Series RR-08--37). Princeton, NJ: Educational Testing Service. Xi, X., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL academic speaking test (TAST). (TOEFL iBT Research Report TOEFLiBT-01). Princeton, NJ: Educational Testing Service.

Poon Poon

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Poon Poon

Diunggah oleh

Hak Cipta:

Format Tersedia

Spaan Fellow Working Papers in Second or Foreign Language Assessment

Copyright 2010 Volume 8: 6994

English Language Institute University of Michigan www.lsa.umich.edu/eli/research/spaan

Figure 1. A proposed rating guide for delivery.

No 2b) Is the speaker able to construct grammar at sentence level? No

Figure 3. A proposed rating guide for topic development.

Anda mungkin juga menyukai