Anda di halaman 1dari 19

HUMAN PERFORMANCE, 20(4), 415432

Copyright 2007, Lawrence Erlbaum Associates, Inc.

When Traits Are Behaviors: The


Relationship Between Behavioral
Responses and Trait-Based Overall
Assessment Center Ratings
Duncan J. R. Jackson and Andrew R. Barney
Department of Management & International Business
Massey University Albany

Jennifer A. Stillman
School of Psychology
Massey University Albany

William Kirkley
Department of Management & International Business
Massey University Albany

Interest in exercise effects commonly observed in assessment centers (ACs) has resurfaced with Lance, Lambert, Gewin, Lievens, and Conways 2004 study. The study
presented here addressed the construct validity puzzle associated with ACs by investigating whether traditional trait-based overall assessment ratings (OARs) could be
explained by behavioral performance on exercises. In a sample of 208 job applicants
from a real-world AC, it was found that the multivariate combination of scores from
three behavioral checklists explained around 90% (p < .001) of the variance in supposedly trait-based OARs. This study adds to the AC literature by suggesting that
traditional OARs are predictive of work outcomes because they reflect exercise-specific behavioral performance rather than trait-based assessments. If this is the case,
validity and efficiency are best served by abandoning redundant trait ratings (dimensions) in favor of more direct behavioral ratings.

Correspondence should be sent to Duncan J. R. Jackson, Department of Management & International Business, Massey University Albany, Private Bag 102904, North Shore MSC, Auckland, New
Zealand. E-mail: d.j.r.jackson@massey.ac.nz

416

JACKSON ET AL.

Assessment centers (ACs) are a popular technique used by human-resource


management professionals since their espousal into the organizational arena in the
1960s (Thornton, 1992). Research into AC ratings has shown that presumed
method-driven variance constitutes a notable and prevalent quandary for this technique (Sackett & Dreher, 1982). Dimensions, expressed in multitraitmultimethod
terms, in the AC literature are expected to reveal themselves across a set of exercises. For a given individual, the common expectation is approximate consistency
in the manifestation of dimensions across exercises. These manifestations should,
theoretically, form the basis of distinctions drawn between individuals for employment decisions. Instead, ratings from ACs typically return correlations among different traits within exercises (heterotraitmonomethod [HTMM] correlations) that
are substantially larger than correlations among the same traits observed across exercises (monotraitheteromethod [MTHM] correlations; Lance, Lambert, Gewin,
Lievens, & Conway, 2004).
Traditionally the so-called exercise effect just described is considered a manifestation of error to be minimized, and this remains the view of many researchers
(for a review, see Jackson, Stillman, & Atkins, 2005). This perspective is puzzling
for two reasons. First, because the problem of exercise effects was first identified
in 1982 (Sackett & Dreher, 1982), attempts to design ACs to minimize or eliminate
this supposed error have largely failed. Notably, Joyce, Thayer, and Pond (1994)
explored this issue by looking at the validity of person- versus task-oriented dimensions. Although this analysis bridged on a task-based procedure, MTMM expectations were drawn and a lack of construct validity was found in their study. A
recent article (Lance et al., 2004) concluded that HTMM correlations are usually
(much) stronger than MTHM correlations across a range of different ACs in current use. Clearly, method-driven variance is an integral part of ACs. It is more surprising, given the dominance of this method-driven variance, that, paradoxically,
overall ratings in ACs have been repeatedly shown to be predictive of future work
performance (Woehr & Arthur, 2003). It seems unlikely that error should be so
consistently predictive of work outcomes. Second, in the broader field of recruitment and selection, work sample testing has also been shown to be predictive of future work performance (Schmidt & Hunter, 1998). Could it be that method-driven
variance in ACs is not error but a true measure of situationally specific, job-relevant performance?
One way to test for the importance of situationally specific assessments in overall AC ratings would be to analyze the specific behaviors underpinning traditional
AC dimensions. These behavioral indicators could then be considered in terms of
their relationship with traditional, dimension-based overall AC ratings. Although,
in the wider context, this approach is seemingly logical, conventionally behavioral
responses in ACs are largely downplayed, or merely recorded using checklists
(e.g., Lievens, 1998), but not formally analyzed or used as decision-making criteria. Rather, it is the trait-based dimensions inferred from these behaviors that are

WHEN TRAITS ARE BEHAVIORS

417

used as the basis for employment decisions. A reasonable take on the literature
suggests that there have been few analyses of these exercise-specific behaviors,
and the focus has been primarily on dimensions (see Jackson et al., 2005; Russell
& Domm, 1995).

BACKGROUND
Lance et al. (2004) reemphasized long-held concerns associated with the measurement properties of ACs. They also aimed to minimize various biases that can result
from specifying particular parameter constraints in structural models. In a
reanalysis of Lievens and Conways (2001) large-scale review of AC measurement
properties, and contrary to the suggestions in that article, Lance et al. concluded
that exercise effects are prominent in ACs. In many ways, the arguments presented
by Lance et al. have taken the AC measurement debate full circle, bringing attention back to concerns that were raised more than 20 years ago on method-driven
variance outweighing any dimension variance that was intended for assessment in
an AC (Sackett & Dreher, 1982).
The legacy of AC research over the last 2 decades, with respect to construct validity, has been an unsatisfactory and nonresolute position for both academics and
practitioners. Construct validity issues hold clear implications for practice in, for
example, developmental ACs, where feedback may be provided on the basis of
constructs for which there is no psychometric justification. Moreover, selecting
people on the basis of nondefensible constructs has clear implications for fairness,
accuracy, and efforts to both understand and improve an organizations assessment
procedures (Gatewood & Feild, 2001).

ASSESSMENT CENTERS PROBABLY ASSESS


CONSTRUCTS OF SORTS
As Lance et al. (2004) and Sackett and Dreher (1982) pointed out, it is likely that
ACs assess constructs of sorts because they have been shown to predict organizational outcomes, and often convincingly so (Woehr & Arthur, 2003). The focal issue, however, is that there appears to be no consensus yet on which constructs are
being tapped through this technique. According to Lance et al., the current evidence points away from the measurement of trait-based dimensions, assumed to
hold stability across simulation exercises. An alternative perspective suggested by
Lowry (1995, 1997); Lance et al. (2004); and Lance et al. (2000) focused instead
on the effects of exercises, treating these as units of measurement.
Such an approach requires a shift in thinking for many studies examining the
psychometric properties of ACs, because method variance is often considered in-

418

JACKSON ET AL.

dicative of error. It may be, however, that AC assessments cannot be regarded in


traditional psychological terms. Lowry (1997) defines two types of AC: the first is
the traditional dimension-specific AC, which aims to return MTHM, that is, stable
trait, scores, and concerning which construct validity issues arise. The alternative
is the task-specific AC, which treats each exercise as a stand-alone measure of
work-related behavioral performance. The latter abandons the use of trait-based
dimensions altogether.

THEORY AND RESEARCH INTO TASK-SPECIFIC ACs


The theoretical framework, under the task-specific approach, functions in a similar
manner to work samples. Behavior in work samples is inextricably linked to the effects of the situation, and situational characteristics are usually designed in such a
manner that they resemble, to varying degrees of fidelity, an actual job (Robertson
& Kandola, 1982). As such, the unit of measurement in a work sample is a behavioral response anchored to the effects of a work-relevant situation. Lance et al.
(2004) stated that although task-specific ACs have been presented in theoretical
terms, there is otherwise little or no additional evidence on their reliability or validity (p. 383). They suggested that task-specific ACs present an important area
for research, particularly given the method-driven variance found to dominate the
AC literature, and the manner in which task-specific ACs place importance and
value on situational influences.
Although evidence on task-specific ACs is rare, that which has been reported is
encouraging. Notably, Lowry (1997) presented some preliminary evidence in favor of the approach, primarily in a narrative sense. More recently, Jackson et al.
(2005) presented empirical support for task-specific ACs, which showed evidence
of acceptable structural characteristics and reliability. In a repeated measures
study, the authors went further to show how task-specific and dimension-specific
ACs appeared to hold similar psychometric characteristics. Specifically, both approaches returned factors that resembled performance on exercises. Although exercise performance is acceptable in task-specific ACs, it is conceptually problematic for dimension-specific ACs. Jackson et al. suggested that dimension-specific
ACs may actually be task-specific ACs in disguise.

OVERALL ASSESSMENT RATINGS


Underlying conceptual and construct validity issues aside, in practice, it is overall
assessment ratings (OARs) that are often used as the basis for employment decisions when using ACs (Arthur, Day, McNelly, & Edens, 2003; Gaugler, Rosenthal,
Thornton, & Bentson, 1987). It is recommended under the current guidelines (In-

WHEN TRAITS ARE BEHAVIORS

FIGURE 1

419

Assessment steps, chronologically, during an assessment center.

ternational Task Force on Assessment Center Guidelines, 2000) that data from an
AC be pooled in some way, either mechanically (e.g., calculating an average) or
through a group discussion.
The process that is traditionally followed to arrive at an OAR is intricate and
commonly involves a series of inferences and aggregations. The approach that is
typically prescribed for dimension-specific ACs (e.g., see Ballantyne & Povah,
2004) can be summarized into three main steps. These are depicted in Figure 1.
The first step is to observe and record behavior within a set of simulation exercises.
Researchers from both the task-specific and dimension-specific schools of thought
recommend using behavioral checklists at this stage (Ahmed, Payne, & Whiddett,
1997; Lievens, 1998; Lowry, 1997). The second step involves inferring trait-based
dimensions from behavior that has been observed and recorded. This information,
and that derived from the following step, is often used for making employment decisions (Spychalski, Quiones, Gaugler, & Pohley, 1997). The third, and usually
final, step is the adoption of an OAR. This is achieved through either mechanical
integration or a discussion held by assessors. Research suggests that no meaningful difference is observed between the predictive validity of OARs that are derived
either mechanically or by way of discussion (Pynes & Bernardin, 1992; Pynes,
Bernardin, Benton, & McEvoy, 1988).

A PREDICTIVE VALIDITY CONUNDRUM


Turning back to our consideration of construct validity, the major conundrum for
researchers and practitioners in ACs has stemmed from the following question: If,
due to a lack of construct validity in dimension-specific ACs, we do not know what
they measure, then how is it that OARs are predictive of work outcomes?

420

JACKSON ET AL.

Research over several decades has shown that ACs, and more specifically
OARs, display evidence of predictive validity (Arthur et al., 2003; Gaugler et al.,
1987; Schmidt & Hunter, 1998). The reason why they hold predictive properties
remains unsolved in the light of Lance et al.s (2004) study. We hypothesize that if
a task-specific interpretation is substantiated, behavioral indicators of performance on exercises, observed at Step 1 (see Figure 1), should account for a meaningful amount of variance in Step 3, the derivation of the trait-based OAR. The extent of this explained variation between behavioral performance and the OAR will
inform on the necessity and importance of the inferences at Step 2, that is, the necessity to infer traits from observed behaviors in ACs.

METHOD
The AC designed for Jackson et al. (2005) was used with a different pool of job applicants (i.e., new data were used in this study). Details of the AC are publicly
available and are, therefore, described only briefly next. The major point of difference in the study presented here was the focus on OARs and what defines these
variables.
Participants
A total of 208 job applicants for a private sector retail chain participated. Jobs applied for were in customer service and general sales. Around 87% of the sample
was female, and the mean age across all participants was 30.1 years (SD = 15.2).
Most participants described themselves as Caucasian (77%) and as having some
high school education (68%). Nine assessors participated in this AC. All participating assessors were managers, as is usually the case in ACs (Lowry, 1996;
Spychalski et al., 1997) who had a minimum of 2 years experience in the positions
under scrutiny and were not formally trained in psychology.
AC Construction
The AC was designed according to a two-tiered process. Tier 1 involved the development of task-specific rating criteria, and Tier 2 involved the development of dimension-specific rating criteria.

Tier 1. Task-specific behavioral checklists. Using task-analysis procedures described in Lowry (1997), inductive job analysis interviews were used to
determine job-relevant tasks, which were, in turn, incorporated into a deductive
task-analysis questionnaire. The tasks rated as most important for selection were
retained for behavioral checklists used in the AC exercises. Performance on exer-

WHEN TRAITS ARE BEHAVIORS

421

cises was rated on a scale ranging from 1 (certainly below standard) to 6 (certainly
above standard). An example of an item on a behavioral checklist was Speaks
clearly and annunciates appropriately. These behavioral checklist items differ
from the trait-based assessment that follows for two fundamental reasons. First,
these judgments were only considered within exercises, and as such no behavioral
patterning across exercises is sought at this level of assessment. Second, and related to the first point, trait-level inferences were not drawn in the checklists. They
were intended as job-relevant situationally specific descriptors only, and no inferences were made beyond this stance.

Tier 2. Dimension-specific rating criteria. A deductive job analysis instrument known as the threshold traits analysis (Lopez, Kesselman, & Lopez,
1981) was used to guide the development of dimensions used in this AC. Assessors
were instructed to use behavioral evidence, including any notes and behavioral
checklist information, to guide their assessment and inference of global dimensions (as suggested by Ahmed et al., 1997; see Lievens, 1998, for dimension-specific ACs). Note that, contrary to the Tier 1 behavioral assessment, trait inferences
were drawn at this level, including the inference of consistency of behavior across
exercises. The parsimonious number of five dimensions (as suggested by Lievens
& Klimoski, 2001) assessed were the following:
1. Teamwork. The extent to which the individual works effectively and harmoniously with other team members.
2. Customer focus. The extent to which the individual is concerned with customer needs, describes products accurately, matches presentations to the
customers interests, and attempts to assist customers to make satisfactory
purchases.
3. Oral expression. The extent to which the individual speaks grammatically
and clearly using appropriate language and appropriate gestures.
4. Tolerance. The extent to which the individual interacts effectively with
people despite delicate, frustrating, or tense situations that demand understanding, patience, and empathy.
5. Comprehension. The extent to which the individual understands spoken
and written, verbal, or behavioral language.

Simulation exercises. Exercise development was guided by the job analysis


information outlined previously, and in consultation with subject matter experts, to
help ensure relevance and content validity. Three exercises were designed in total,
in such a way that all dimensions were assessed in all exercises to produce a fully
crossed model. Each was of a similar format, specifically, a job-relevant group discussion, so as to minimize the effect of exercises and to encourage a dimension-specific assessment. The exercises covered aspects of the roles being as-

422

JACKSON ET AL.

sessed, including approaching customers, closing sales, and protocols associated


with the return of goods.

Assessor training. Frame-of-reference training was used as a standard-setting procedure for assessors, as suggested by numerous researchers in the field
(Bernardin & Buckley, 1981; Lievens, 1998; Schleicher, Day, Mayes, & Riggio,
2002). Training was focused on key issues suggested in the literature as maximizing conditions for dimension measurement (e.g., Lievens, 1998). Assessors were
also trained on the task-specific component of the AC, that is, the behavioral
checklists, in a similar manner.
Derivation of overall assessment ratings. The OAR used in this applied
AC was based on assessor panel judgment across dimensions, whereby assessors
gathered subsequent to the AC to discuss overall performance of participants with
respect to traits. Overall ratings are commonly pooled in this manner (Ballantyne
& Povah, 2004; International Task Force on Assessment Center Guidelines, 2000).
Procedure
The literature suggests that rating performance just after exercises (within-exercise
approach) versus waiting until the end of the AC and rating across dimensions
(within-dimension approach) adds little to the facilitation of construct validity
(Harris, Becker, & Smith, 1993; Silverman, Dalessio, Woods, & Johnson, 1986).
Thus, a within-exercise approach to rating was followed in the interests of maintaining assessor recollections of the performance of candidates on exercises. The
task-specific rating on behavioral checklists naturally preceded the rating on dimensions, as behavioral assessment is logically prior to the evaluation of
dimensions.
The intention in this study was to retain a data-driven approach to the analysis
of the ratings obtained. As such, oblique factor analyses were used to identify general patterns in the task and dimension-specific assessment steps. For the task-specific step, scores on behavior within exercises were developed by aggregating
scores on each behavioral checklist associated with each exercise (a total of three
exercise scores). The focus of this study was to investigate the relationship between behavioral checklist scores and trait-based OARs. Multiple regression coefficients were used to explore this relationship.

RESULTS
Two sets of results were pertinent to this study. First, as background, the underlying structure of the data was explored. Second, and representing the primary aim of

WHEN TRAITS ARE BEHAVIORS

423

this study, the relationship between behavioral checklist scores and trait-based
OARs was investigated.

Underlying Structure of the Data


A principal axis factor analysis was run for the behavioral checklists (see Table 1),
and for the dimension assessment an MTMM was presented (Table 2) and its associated factor analysis (Table 3). Oblique rotation was used to allow for
interexercise correlations commonly found in the literature (see Lance, Noble, &
Scullen, 2002, p. 231). Note that loadings less than .30 were suppressed for clarity.
For both analyses, evidence of factorability was reassuring (see Hair, Anderson,
Tatham, & Black, 1998, for specific criteria) for behavioral checklists (KMO =
.92, Bartletts 2 = 5811.19, df = 435, p < .001) and dimensions (KMO = .90, Bartletts 2 = 2409.95, df = 105, p < .001). Most of the communality estimates (h2)
shown in Tables 1 and 3 were supportive in terms of evidence for factorability. The
factors shown in Tables 1 and 3 for both approaches were clearly interpretable as
being indicative of exercise-driven variance. Similarly, exercise-specific variance
was displayed for the behavioral checklists with each item loading notably on to
one of the three exercises included in the analysis. Cumulative variance explained
for the behavioral and dimension-based factor solutions appeared reasonable. The
MTMM returned HTMM correlations that were genenerally higher (median r =
.61) than MTHM correlations (median r = .34).

The Relationship Between Behavioral Scores


and Trait-Based OARs
Multiple linear regression was used to investigate the relationship between behavioral scores and trait-based OARs. Aggregated scores from the three behavioral
checklists were entered as predictors and the dimension-based (trait-based) OARs
were entered as the dependent variable (DV). The overall model was significant,
F(3, 204) = 654.33, p < .001. Moreover, the effect size for the overall model was
found to be substantial (adjusted R2 = .90). Standardized betas for the three behavioral checklists were all found to hold a statistically significant impact on the OAR
(p < .001 for all predictors). The betas were comparable for checklists associated
with Exercises 1 ( = .37) and 2 ( = .33), and the beta associated with Exercise 3
was slightly higher ( = .51). Practitioners might be interested in using average behavioral checklist scores in the future to represent overall AC performance. As
such, the bivariate correlation between average checklist performance (the overall
exercise rating, or OER) and trait-based OARs was computed. The correlation between OERs and OARs was found to be substantial, to the point of suggesting that
they are one and the same (r = .95, p < .001).

TABLE 1
Factor Analysis of Behavioral Checklists
Factor
Item

Ex1b1
Ex1b2
Ex1b3
Ex1b4
Ex1b5
Ex1b6
Ex1b7
Ex1b8
Ex1b9
Ex1b10
Ex3b1
Ex3b2
Ex3b3
Ex3b4
Ex3b5
Ex3b6
Ex3b7
Ex3b8
Ex3b9
Ex3b10
Ex2b1
Ex2b2
Ex2b3
Ex2b4
Ex2b5
Ex2b6
Ex2b7
Ex2b8
Ex2b9
Ex2b10

.76
.81
.76
.65
.77
.78
.80
.78
.72
.63

SS
Cumulative %

h2

SD

.60
.71
.66
.74
.75
.68
.79
.84
.60
.82

.60
.60
.52
.45
.68
.64
.63
.65
.60
.44
.50
.58
.52
.65
.62
.58
.61
.59
.42
.60
.81
.82
.62
.79
.71
.79
.80
.72
.73
.72

4.99
4.92
4.68
5.02
4.55
4.78
4.33
4.90
4.77
5.03
4.70
4.52
4.78
4.86
4.44
4.91
4.81
4.89
5.08
5.03
4.42
4.33
4.32
4.38
4.33
4.21
3.96
4.07
4.55
4.51

0.84
0.86
0.90
0.88
1.01
1.00
1.04
1.07
1.02
0.85
0.95
1.02
0.87
0.99
0.98
0.95
0.94
0.87
0.86
0.84
1.11
1.10
1.13
1.12
1.04
1.21
1.13
1.20
1.22
1.13

.91
.92
.79
.86
.87
.93
.91
.81
.70
.77
8.24
40.40

9.47
54.12

8.46
63.33

Note. Principal axis factor analysis with direct oblimin rotation and a scree criterion. Factor correlations between 1 and 2 = .36, 1 and 3 = .46, 2 and 3 = .43. Ex1 through Ex3 indicate the simulation exercises used in this study. h2 = communality estimates upon extraction; SS = rotated sums of squared
loadings; % = cumulative percentage of variance explained.

424

425

.64
.49
.57
.55
.38
.32
.17
.32
.25
.20
.30
.12
.22
.20

.65
.71
.62
.42
.41
.25
.38
.30
.29
.40
.24
.34
.29

.81
.61
.38
.38
.33
.45
.39
.24
.34
.13
.29
.23
.63
.38
.34
.28
.43
.37
.29
.40
.21
.35
.29
.31
.26
.18
.33
.34
.26
.34
.19
.32
.28
.62
.48
.55
.56
.27
.39
.23
.35
.29
.54
.65
.55
.35
.45
.34
.38
.36
.71
.53
.29
.34
.34
.33
.35
.71
.38
.43
.38
.41
.45

.42
.44
.41
.45
.46

.78
.77
.88
.84

.69
.81
.73

.79
.80

.84

Note. All correlations were significant (p < .05). Monotraitheteromethod correlations appear in frames (median = .34,
interquartile range (IQR) = .13), heterotraitmonomethod correlations appear in bold (median = .61, IQR = .09). D = dimension; D1 =
comprehension; D2 = oral expression; D3 = tolerance; D4 = teamwork; D5 = customer focus; Ex = exercise.

Ex1D1
Ex1D2
Ex1D3
Ex1D4
Ex1D5
Ex2D1
Ex2D2
Ex2D3
Ex2D4
Ex2D5
Ex3D1
Ex3D2
Ex3D3
Ex3D4
Ex3D5

Ex1D1 Ex1D2 Ex1D3 Ex1D4 Ex1D5 Ex2D1 Ex2D2 Ex2D3 Ex2D4 Ex2D5 Ex3D1 Ex3D2 Ex3D3 Ex3D4

TABLE 2
MultitraitMultimethod Matrix for Assessment Dimensions and Exercises

426

JACKSON ET AL.

TABLE 3
Factor Analysis of Trait-Based Assessment
Factor
Item
Ex1D1
Ex1D2
Ex1D3
Ex1D4
Ex1D5
Ex2D1
Ex2D2
Ex2D3
Ex2D4
Ex2D5
Ex3D1
Ex3D2
Ex3D3
Ex3D4
Ex3D5
SS
%

h2

SD

.63
.72
.81
.88
.68

.48
.70
.68
.77
.56
.50
.58
.57
.79
.58
.86
.73
.74
.90
.82

4.99
4.76
4.74
4.57
5.05
4.93
4.78
4.82
4.63
4.82
4.43
4.41
4.45
4.12
4.50

0.79
1.06
0.89
1.02
0.84
0.90
0.98
0.88
0.95
0.94
1.15
1.18
1.02
1.17
1.11

.70
.82
.77
.86
.76

.95
.76
.87
.93
.89
5.22
44.69

4.57
59.71

4.91
68.37

Note. Principal axis factor analysis with direct oblimin rotation and a scree criterion. Factor correlations between 1 and 2 = .34, 1 and 3 = .50, 2 and 3 = .50. h2 = communality estimates upon extraction; D = dimension; D1 = comprehension; D2 = oral expression; D3 = tolerance; D4 = teamwork; D5 =
customer focus; Ex = Exercise; SS = rotated sums of squared loadings; % = cumulative percentage of
variance explained.

DISCUSSION
In this AC study scores on behavioral checklists explained most of the variance in
OARs that were supposedly based on trait inferences (adjusted R2 = .90). Taken
with the exercise-based structure of ratings suggested in Tables 1, 2, and 3, the results of this study suggest that OARs may have held predictive validity all along
because they reflect cross-situationally specific performance. If OARs actually reflect situationally specific performance, then drawing trait inferences in AC exercises may be pointless and misleading. In the overview of this article, it was explained that exercise effects have traditionally been viewed as error, confounding
the supposed true variance among the measured dimensions. However, two decades of AC development have not eliminated presumed method variance from
this technique. Exercise-based variance was recently shown to underpin ratings
across a wide spectrum of ACs (Lance et al., 2004). Nevertheless, OARs derived
from ACs have repeatedly shown to be predictive of future work performance (Ar-

WHEN TRAITS ARE BEHAVIORS

427

thur et al., 2003). It seems paradoxical (at least) that such an apparently error-ridden instrument should yield any sort of predictive utility.
Defining or Redefining Assessment Centers
One solution to this puzzle may be that exercise effects are not error as conventionally believed but simply indicate of situationally driven behavioral performance.
Given the research into situational effects on human behavior (Mischel, 1984), this
supposed measurement error may, from an alternative perspective, contribute to the
predictive qualities of OARs. To test this hypothesis, we analyzed the relationship
between situationally specific behavioral performance, as measured by behavioral
checklists, and traditional OARs, based on trait inferences. Note that behavioral
checklists represented within exercise, behavioral performance on separate checklists. Whereas, and in line with the traditional approach to ACs, OARs were intended
to summarize trait levels across exercises. The behavioral and trait information was
set up to represent distinct steps in the rating process (see Figure 1). In Step 1, performance was observed and recorded by exercise. In Step 2, and on a separate rating instrument, traits were inferred from behavioral observations. In turn, OARs were developed by discussing trait-based dimensions. Thus, under strict trait theory, the
inferences drawn at Steps 2 and 3 represent a search for cross-situationally stable behavior. In this study, the exercises were quite similar in format in an effort to minimize method variance, yet HTMM correlations were still stronger than MTHM correlations in the dimension assessment. In the traditional approach to ACs, Step 1, the
behavioral information, is secondary to the trait inferences. We argue that this should
not be the case as OARs may only ever reflect behavioral responses anyway. As such,
Steps 2 and 3 in Figure 1 may be redundant.
The finding that supposedly trait-based OARs may actually be based on
cross-situationally specific performance raises two questions. First, can measurements of behavior lead to accurate predictors of future work performance, and second, what value is added by making conventional trait inferences based on these
behaviors?
The first of these questions is easiest to answer. As previously discussed,
work-sample testing has been repeatedly shown to be highly predictive of future
work performance (Schmidt & Hunter, 1998) through consistency of behavioral
responses in similar situations (Wernimont & Campbell, 1968). If ACs are purposely designed as collections of work samples, then they may well share the high
predictive validity associated with work samples. The second question is harder to
answer. Conventionally, the measurement of dimensions is seen as more commercially desirable than the measurement of situation-specific behavior because it,
supposedly, allows behavioral generalization (see Cooks, 2004, description of the
specificity of work samples). If, as the findings of this study suggest, ACs are collections of situationally specific exercises, then their generality has been estab-

428

JACKSON ET AL.

lished through the meta-analyses already in publication (Arthur et al., 2003;


Gaugler et al., 1987). It is important to note, however, that under a task-specific approach, the interaction effects observed between people and exercises, rather than
the inference of global traits, need to be given credit for the predictive qualities of
ACs. If the interaction between people and situations determine the predictive validity of ACs, then we suggest that the choice of exercise might be more important
than originally thought. Work samples that are conceptually closer to the requirements of job performance are likely to be preferable to low fidelity situational
exercises.
An alternative, exercise-based approach may lead some to question the definition of an AC. Can a collection of work samples or situational exercises still be defined as an AC? The international guidelines on ACs provide the following definition: An assessment center consists of a standardized evaluation of behavior
based on multiple inputs. Several trained observers and techniques are used. Judgments about behavior are made, in major part, from specifically developed assessment simulations (International Task Force on Assessment Center Guidelines,
2000, p. 319). It appears that nothing within this fundamental definition is contrary
to the ideas expressed under the task-specific approach. It would also seem that
treating exercises as work samples would allow for a more holistic perspective on
human performance, particularly if psychological testing is included in the procedure. With a richer assessment from behavioral measures (based on performance
on exercises) and traits (based exclusively on psychological assessments, e.g., personality), then surely conditions for predictive validity are likely to be optimized.
An Alternative Method of Developing ACs
The intention to measure trait-based dimensions in ACs is common (Sackett &
Dreher, 1982) yet seldom realized empirically (Lance et al., 2004). Jackson et al.
(2005) commented that quite strict trait-oriented expectations are often drawn in
ACs. The major emerging alternative to the traditional dimension-specific approach is the task-specific AC (Lowry, 1997). The course of action set out by Lowry for the development of task-specific ACs is the same as procedures often described for dimension specific ACs, but it holds a key difference.
In his suggestions for the development of a task-specific approach, Lowry
(1997) proposed that methods of job analysis are important as the foundation of
ACs. This is also argued consistently elsewhere for the dimension-specific process
(see Ballantyne & Povah, 2004; Woodruffe, 1993). Lowry emphasized the importance of the job relevance of exercises, the training of assessors, and for the use of
behavioral checklists. Again, such recommendations are often made for the traditional dimension-specific approach (Ahmed et al., 1997; Lievens, 1998). The major difference, however, between task and dimension-specific ACs is the omission
of trait inferences in the former approach. As Lowry stated, exercises (work sam-

WHEN TRAITS ARE BEHAVIORS

429

ples) and not performance dimensions should be used as the basis for assessor
scores (p. 54). The results of this study suggest that average checklist scores
(OERs) are so closely related to OARs anyway (r = .95, p < .001) that trait-based
inferences drawn from exercise performance may be redundant.
A Step Toward Solving an Old Puzzle
The exercise effect has possibly been one of the most enduring empirical findings
throughout the AC literature, dating back to when it was first identified in Sackett
and Drehers (1982) study. However, the issue has never reached an acceptable
conclusion, despite numerous attempts to resolve it and to encourage ACs and assessors to rate people on trait-based dimensions. Although Lievens and Conways
(2001) review provided initial hope for dimension-specific ACs, Lance et al.s
(2004) re-analysis of their data reaffirmed the prominence of exercise effects.
The perplexing question surrounding the AC measurement debate has centered
on why ACs are predictive of important work-related criteria. If evidence suggests
that ACs do not measure dimensions, then what is it about the process that makes it
useful for making employment decisions? OARs are the most commonly used
scores from ACs for the purpose of investigating predictive validity (Arthur et al.,
2003; Gaugler et al., 1987; Schmidt & Hunter, 1998). This study makes a significant contribution to the literature by informing on this apparent paradox. It was
found, in this study, that OARs most likely reflect situationally specific behavioral
performance rather than the commonly intended trait measures. The implication
here is that traditional OARs have been found to be predictive in the past because
they are likely to reflect behavioral performance on specific exercises.
Implications for Practice and Conclusions
The results of this study suggest that task-specific ACs (defined by Step 1 in Figure
1) returned a conceptually acceptable factor solution (see Table 1). The reliability
of ratings was high, with coefficients alpha in excess of .90 for each behavioral
checklist. Fewer inferences were drawn in the task-specific step of the assessment,
thus lowering the potential for error and saving time. For human resource professionals, the task-specific design may present an effective alternative to more traditional AC architecture.
The results of this study provide new directions, not only for research but also
for practice. For any company wishing to use ACs, the focus may be on using a
technique that is able to predict work outcomes. Although we have known for
some time that ACs are useful predictors of such outcomes, the reasons for this
have remained unclear. The suggestion of this study is that usual predictive properties of the traditional OAR are likely to be retained if a summary is derived of behavioral performance on exercises, as opposed to dimensional performance. A

430

JACKSON ET AL.

construct valid direction for the development of overall ratings could be to average
behavioral performance across exercises and use OERs instead.
Organizations could design ACs in such a way that each exercise is scored individually with behavioral checklists that are specific to particular exercises. The traditional approach is to score trait-like variables across exercises, which has been
shown, more than not, to return construct invalid results. With a task-specific framework, human resource practitioners could expect construct valid ratings coupled
with the usual predictive properties associated with ACs. Such an approach would
carry fewer inferences and would therefore be less complex, less time consuming,
and easier to train assessors on. This study also highlights the importance of the exercises included in ACs. Despite the similarity in format of each of the exercises used in
this study, clear exercise effects resulted. Given the link between behaviors and situations (exercises) in ACs, it is important that exercises bear job relevance.
The analyses reported in this article provide evidence for the idea, as suggested
previously by Lowry (1997), that behavior is inextricably linked to exercises in
ACs. For this reason, a shift from a dimension-specific to a task-specific approach
may lead to AC practices that are justifiable both conceptually and legally (Lowry,
1996; Norton, 1977). Task-specific ACs present a green field in terms of potential
research. As Lance et al. (2004) and Jackson et al. (2005) commented, more information is needed on a host of issues surrounding their use. The results of this study,
in line with others, add further suggestion that dimension labels should be dropped
in ACs. If OARs actually reflect cross-situationally specific performance, then the
use of dimensions as stable, trait-based measures is potentially misleading, inaccurate, and redundant. This issue creates an ethical dilemma, particularly in developmental ACs, where inaccurate trait-based measurement and feedback may have a
detrimental effect on participants. Assessment on the basis of exercise-specific behavioral output may be the key to resolving such issues.

REFERENCES
Ahmed, Y., Payne, T., & Whiddett, S. (1997). A process for assessment exercise design: A model of
best practice. International Journal of Selection and Assessment, 5, 6268.
Arthur, W., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related
validity of assessment center dimensions. Personnel Psychology, 56, 125154.
Ballantyne, I., & Povah, N. (2004). Assessment and development centres (2nd ed.). Hampshire, UK:
Gower.
Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6, 205212.
Cook, M. (2004). Personnel selection: Adding value through people (4th ed.). Chichester, UK: Wiley.
Gatewood, R. D., & Feild, H. S. (2001). Human resource selection (5th ed.). Fort Worth, TX: Harcourt
College.

WHEN TRAITS ARE BEHAVIORS

431

Gaugler, B., Rosenthal, D., Thornton, G., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72, 493511.
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate data analysis (5th ed.).
Upper Saddle River, NJ: Prentice-Hall.
Harris, M. M., Becker, A. S., & Smith, D. E. (1993). Does the assessment center scoring method affect
the cross-situational consistency of ratings? Journal of Applied Psychology, 78, 675678.
International Task Force on Assessment Center Guidelines. (2000). Guidelines and ethical considerations for assessment center operations. Public Personnel Management, 29, 315331.
Jackson, D. J. R., Stillman, J. A., & Atkins, S. G. (2005). Rating tasks versus dimensions in assessment
centers: A psychometric comparison. Human Performance, 18, 213241.
Joyce, L. W., Thayer, P. W., & Pond, S. B. (1994). Managerial functions: An alternative to traditional
assessment center dimensions. Personnel Psychology, 47, 109121.
Lance, C. E., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of
dimension and exercise variance components in assessment center postexercise dimension ratings.
Journal of Applied Psychology, 89, 377385.
Lance, C. E., Newbolt, W. H., Gatewood, R. D., Foster, M. R., French, N. R., & Smith, D. (2000). Assessment center exercise factors represent cross-situational specificity, not method bias. Human Performance, 13, 323353.
Lance, C. E., Noble, C. L., & Scullen, S. E. (2002). A critique of the correlated trait-correlated method
and correlated uniqueness models for multitrait-multimethod data. Psychological Methods, 7,
228244.
Lievens, F. (1998). Factors which improve the construct validity of assessment centers: A review. International Journal of Selection and Assessment, 6, 141152.
Lievens, F., & Conway, J. M. (2001). Dimension and exercise variance in assessment center scores: A
large-scale evaluation of multitraitmultimethod studies. Journal of Applied Psychology, 86,
12021222.
Lievens, F., & Klimoski, R. J. (2001). Understanding the assessment center process: Where are we
now? In C. L. Cooper & I. T. Robertson (Eds.), International review of industrial and organizational
psychology, Vol. 16 (pp. 245286). Chichester, UK: Wiley.
Lopez, F. M., Kesselman, G. A., & Lopez, F. E. (1981). An empirical test of a trait-oriented job analysis
technique. Personnel Psychology, 34, 479502.
Lowry, P. E. (1995). The assessment center process: Assessing leadership in the public sector. Public
Personnel Management, 24, 443450.
Lowry, P. E. (1996). A survey of the assessment center process in the public sector. Public Personnel
Management, 25, 307321.
Lowry, P. E. (1997). The assessment center process: New directions. Journal of Social Behavior and
Personality, 12, 5362.
Mischel, W. (1984). Convergences and challenges in the search for consistency. American Psychologist, 39, 351364.
Norton, S. D. (1977). The empirical and content validity of assessment centers vs. traditional methods
for predicting managerial success. Academy of Management Review, 2, 442445.
Pynes, J., & Bernardin, H. J. (1992). Mechanical vs. consensus-derived assessment center ratings: A
comparison of job performance validities. Public Personnel Management, 21, 1728.
Pynes, J., Bernardin, H. J., Benton, A. L., & McEvoy, G. M. (1988). Should assessment center dimension ratings be mechanically-derived? Journal of Business and Psychology, 2, 217227.
Robertson, I. T., & Kandola, R. S. (1982). Work sample tests: Validity, adverse impact and applicant reaction. Journal of Occupational Psychology, 55, 171183.
Russell, C. J., & Domm, D. R. (1995). Two field tests of an explanation of assessment centre validity.
Journal of Occupational and Organizational Psychology, 68, 2547.

432

JACKSON ET AL.

Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment center dimensions: Some troubling
empirical findings. Journal of Applied Psychology, 67, 401410.
Schleicher, D. J., Day, D. V., Mayes, B. T., & Riggio, R. E. (2002). A new frame for frame-of-reference
training: Enhancing the construct validity of assessment centers. Journal of Applied Psychology, 87,
735746.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262274.
Silverman, W. H., Dalessio, A., Woods, S. B., & Johnson, R. L. (1986). Influence of assessment center
methods on assessors ratings. Personnel Psychology, 39, 565578.
Spychalski, A. C., Quinones, M. A., Gaugler, B. B., & Pohley, J. (1997). A survey of assessment center
practices in organizations in the United States. Personnel Psychology, 50, 7190.
Thornton, G. C., III (1992). Assessment centers in human resource management. New York:
Addison-Wesley.
Wernimont, P., & Campbell, J. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52,
372376.
Woehr, D. J., & Arthur, W. (2003). The construct-related validity of assessment center ratings: A review
and meta-analysis of the role of methodological factors. Journal of Management, 29, 231258.
Woodruffe, C. (1993). Assessment centres: Identifying and developing competence (2nd ed.). London:
Institute of Personnel Development.

Anda mungkin juga menyukai