Assessing Measurement Equivalence Across Rating Sources

Journal of Applied Psychology 2005, Vol. 90, No.
3, 592 600
Copyright 2005 by the American Psychological Association 0021-9010/05/$12.00 DOI: 10.1037/0021-9010.90.3.592
Assessing Measurement Equivalence Across Rating Sources: A MultitraitMultirater Approach

David J. Woehr
University of Tennessee
M. Kathleen Sheehan
Texas A&M University
Winston Bennett Jr.

Air Force Research Laboratory
Research examining the structure of multisource performance ratings has demonstrated that ratings are a direct function of both who is doing the rating (rating source) as well as what is being rated (performance dimension). A separate line of research has focused on the extent to which performance ratings are equivalent across sources. To date no research has examined the measurement equivalence of multisource ratings within the context of both dimension and rating source direct effects on ratings. We examine the impact of both performance dimension and rating source as well as the degree of measurement equivalence across sources. Results indicate that (a) the impact of the underlying performance dimension is the same across rating sources, (b) the impact of rating source is substantial and only slightly smaller than the impact of the underlying performance dimension, and (c) the impact of rating source differs substantially depending on the source. Keywords: multisource performance ratings, measurement equivalence, performance appraisal, self other agreement
A good deal of literature has focused on the degree of convergence or agreement across sources in multisource performancerating systems (e.g., Conway & Huffcutt, 1997; Harris & Schaubroeck, 1988; Mabe & West, 1982; Viswesvaran, Ones, & Schmidt, 1996), with much of this literature indicating that individuals view their own performance quite differently than others view their performance (Ashford, 1989; Harris & Schaubroeck, 1988). A meta-analytic review of self- and other ratings of performance (Harris & Schaubroeck, 1988) revealed that self-ratings were only moderately correlated with ratings made by supervisors (r .35 for selfsupervisor agreement) and peers (r .36 for selfpeer agreement), whereas correlations between peer and supervisory ratings were substantially higher (r .62). More recent research (e.g., Conway & Huffcutt, 1997; Mount, Judge, Scullen, Sytsma, & Hezlett, 1998; Viswesvaran et al., 1996) has supported this finding and has demonstrated higher levels of agreement between different supervisors rating the same individual than between different peers rating the same individual. Despite findings of lack of agreement across rating sources, multisource feedback systems are becoming increasingly popular (Bracken, Timmreck, & Church, 2001). Proponents argue that cross-source rating differences represent the unique perspective on performance available to different sources. For example, individ-
David J. Woehr, Department of Management, University of Tennessee; M. Kathleen Sheehan, Department of Psychology, Texas A&M University; Winston Bennett Jr., Air Force Research Laboratory, Mesa, Arizona. Correspondence concerning this article should be addressed to David J. Woehr, Department of Management, University of Tennessee, 408 Stokely Management Center, Knoxville, TN 37996. E-mail: djw@utk.edu 592
uals may act differently with subordinates than with supervisors, subordinates may be better situated than peers or supervisors to evaluate certain behaviors (e.g., delegation), and so forth. However, many of the assumptions on which multisource rating systems are predicated remain untested (Church & Bracken, 1997). One crucial and often overlooked assumption is that the rating instrument is measuring the same thing across different rating sources. If the rating instrument is not equivalent across sources, substantive interpretations of the ratings (including the extent to which different sources agree or disagree) and any practical uses or feedback based on the ratings may be inaccurate and misleading. As noted by Cheung (1999), multisource rating systems typically reveal differences among raters, and correctly identifying and understanding these differences is crucial for the effective use of performance information. Given the importance of cross-source measurement equivalence, researchers have more recently begun to focus on evaluating the extent to which performance ratings are equivalent across rating sources. However, to date, all of this research has considered performance ratings to be a direct function of only the underlying performance dimension and unique variance (i.e., error). A separate line of research examining the structure of multisource performance ratings has demonstrated that ratings are a direct function of both who is doing the rating (rating source) as well as what is being rated (performance dimension). Our goal in this study is to extend previous research by examining measurement equivalence within the context of a model that more accurately describes multisource ratings as being a function of the underlying performance dimension and the rating source (as well as unique variance).
RESEARCH REPORTS
593
The most prevalent approach to evaluating measurement equivalence is through the use of confirmatory factor analysis (CFA; Cheung & Rensvold, 1999; Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000). Four previous studies have used CFA procedures to directly assess the measurement equivalence of performance ratings across sources (Cheung, 1999; Facteau & Craig, 2001; Lance & Bennett, 1997; Maurer, Raju, & Collins, 1998). Cheung (1999) used a CFA approach to examine the measurement equivalence of self- and supervisor ratings of a sample of midlevel managers on two performance dimensions (internal and external roles). Similarly, Maurer et al. (1998) used CFA to examine the equivalence of peer and subordinate ratings of a sample of managers on a single performance dimension (team building skill). Results of both studies generally indicated measurement equivalence across the two rating sources. Facteau and Craig (2001) also used a CFA approach for the analysis of self-, supervisor, subordinate, and peer performance ratings of a sample of managers on seven performance dimensions. Their results also indicated that ratings from the four rating sources were conceptually equivalent across the seven performance dimensions. Taken together, these studies appear to provide consistent support for the measurement equivalence of different rating sources. These results, however, stand in contrast to those of Lance and Bennett (1997). These authors analyzed self-, supervisor, and peer ratings of eight samples of U.S. Air Force airmen on one general performance factor (interpersonal proficiency). In five of the eight samples, results indicated that different rater sources had different conceptualizations of performance; that is, the samples lacked measurement equivalence. In addition to the contrasting results presented above, an examination of all of the studies described above indicates a potentially serious limitation. This limitation stems from the model underlying multisource performance used to evaluate measurement equivalence. Specifically, all four studies used a multiple-group CFA methodology that modeled performance ratings solely as a function of the underlying performance dimension (and unique variance or error) and then examined the moderating effect of rating source on rating item loadings on the performance dimension(s). Research on the structure of multisource ratings, however, suggests that the performance dimension is not the only systematic factor underlying ratings (e.g., Conway, 1996; Coovert, Craiger, & Teachout, 1997; Goffin & Jackson, 1992; Mount et al., 1998). Rather, the most common conceptual framework underlying this approach is the multitraitmultimethod (MTMM) design proposed by Campbell and Fiske (1959). In this design two or more traits are measured with two or more methods. When this approach is applied to rating data the traits typically represent performance dimensions, and the methods typically refer to different rating sources (thus, within this context MTMM becomes MTMR [multitraitmultirater]). The majority of research examining the underlying structure of MTMR rating data has focused on the appropriateness of a linear CFA-based MTMM model for describing MTMR data (e.g., Conway, 1996; Coovert et al., 1997; Goffin & Jackson, 1992; Mount et al., 1998). This research suggests that despite various methodological limitations (cf. Marsh, 1989; Marsh & Bailey, 1991), this model provides the best representation of MTMR data, both conceptually and empirically (Conway, 1996; Goffin & Jackson, 1992; Kenny & Kashy, 1992; Mount et al., 1998).
The conceptual appropriateness of the MTMR model is apparent when one considers the observed relationship among the performance dimensions. Specifically, the model suggests that all correlations between performance dimensions, regardless of the magnitude of the actual correlation, are augmented (i.e., inflated) by the same amount when assessed by the same rating source. From a traditional performance-rating perspective this effect is consistent with discussions of true halo and illusory halo (Cooper, 1981; Lance & Woehr, 1986) in that there is an implied tendency for raters to consistently inflate the correlations between performance dimensions (Goffin & Jackson, 1992). In essence, ratingsource effects are viewed as systematic sources of error (i.e., method variance). Alternatively, rating-source effects may be viewed as valid performance-related effects. For example, source effects may represent differences in perspective, opportunity to observe, or other implicit source-specific components of performance (Lance, Baxter, & Mahan, in press; Lance, Teachout, & Donnelly, 1992; Murphy & Cleveland, 1995). The MTMR model is also consistent with this ecological perspective of ratingsource effects (Lance & Woehr, 1989) in that source effects may be postulated to represent valid, systematic sources of performance information. Regardless of whether source effects are viewed as systematic error or as valid performance information, the traditional MTMR approach suggests that performance ratings are a direct linear function of multiple systematic factors. Here it is important to emphasize that although the previous studies of cross-source measurement equivalence may have incorporated multiple performance dimensions and multiple rating sources, the underlying model analyzed did not include both performance dimension and rating source direct effects on performance ratings. Of concern here is that although the multiple groups approach allows for an examination of the direct effect of performance dimension and the moderating effect of rating source (i.e., Dimension Source interaction), it does not permit an examination of the rating source main effect. Further, the exclusion of the rating source main effect may result in biased estimates of both the performance dimension and Dimension Rating Source effects. Specifically, the multiple groups approach is based on the analysis of a separate covariance matrix for each rating source. This matrix represents the interrelations among the performance ratings of the different performance dimensions within rating source. Multiple groups analysis separately evaluates each matrix with respect to a performance-dimension-based model and evaluates measurement equivalence by testing the impact of equality constraints across rating source groups. This approach is based on a limited subset of the data available. That is, by using separate covariance matrices for each rating source, interrelations between ratings of the same dimension(s)as well as between ratings of different dimensions by different sources are not included in the analysis. Thus, this approach is not based on the full data structure, and any relation among ratings not attributable to the underlying performance dimension(s) is not included in the model evaluated. Consequently, previous research does not provide an indication of the degree of measurement equivalence across sources while simultaneously examining the direct impact of rating source on performance ratings (i.e., the interactive effect of rating source and performance dimension is examined without considering the main effect of rating source). Multisource ratings may demonstrate measurement equivalence with respect to the performance dimen-
594
RESEARCH REPORTS
sion assessed, yet the dimension may account for only a small portion of the actual variance and/or covariance in ratings. It is possible that the majority of variance in performance ratings may be attributable to rating-source effects or other unique sources. If this is the case then the actual equivalence across ratings may be dwarfed by other sources of disagreement. As noted earlier, our goal in the present study is to extend the literature on cross-source measurement equivalence. First, we compare the fit of a model of multisource performance ratings as a function of an underlying performance dimension and unique effects with an MTMR-based model with ratings as a function of a rating source factor as well as an underlying performance dimension and unique variance components. This comparison provides a test of the appropriateness of the MTMR model with respect to the performance rating data. We expected, on the basis of the research findings mentioned previously, that the MTMRbased model would provide a more appropriate model of multisource rating data. Second, we evaluate three MTMR-based latent variable models representing differing levels of measurement equivalence across rating sources. Finally, we examine the relative impact of rater source, performance dimension, and unique components on performance ratings.
Models of Measurement Equivalence

We evaluate three MTMR-based models of measurement equivalence. Each of the measurement equivalence models (on which we elaborate below) explains the variance in and covariance among each performance rating in terms of two factors. The first factor represents variance attributable to commonalities across different rating sources (raters) and is attributable to the performance dimension (trait) being assessed. Factor loadings on these dimension factors represent the direct structural relation between a rating and the underlying performance dimension being rated. If the factor loadings on the same dimensions are equivalent across sources, measurement equivalence with respect to the performance dimension would be indicated. The second factor represents variance attributable to the rating source irrespective of the performance dimension being measured. Factor loadings on these source factors provide an index of the direct impact of the rating source (e.g., self, supervisor, or peer) on ratings. Here it should be noted that although only 1 individual per rating source was used as a participant in the present study, each of these individuals differed across ratees. That is, different supervisors and peers were used for the different ratees. Thus, the potentially confounding effect of rater level and specific individual rater was eliminated. In other words, source effects represent effects common across different supervisors and/or different peers (i.e., common within-source variance cannot be attributed to a common rater providing ratings for that source level). Each of the models adds a third component with respect to explaining the variance in performance ratings, that is, variance in the performance ratings that is not common to other ratings, either within dimension or within rating source. In the present case, this variance may be attributable to either unique conceptualizations of the performance dimensions by individual raters or to random unsystematic error effects (however, because there are not multiple raters within each source, it is not possible to separate these effects). Thus, the parameters corresponding to these effects pro-
vide an index of the amount of idiosyncratic variance in performance ratings that is not attributable to either the performance dimension being measured or the rating source. Note that Mount et al. (1998) found that individual raters accounted for more variance in MTMR performance ratings than did rating source groups (e.g., peer, supervisor, self). If this is in fact the case, it is possible that the unique components of the model may account for more variance than either the performance dimensions or rating source. Further, if these parameters are equivalent across sources (as postulated by the parallel model, discussed below), they would indicate that the impact of unique effects on the performance ratings is the same across rating sources. In the present study, all of the models incorporate 24 performance ratings. These ratings correspond to ratings of eight performance dimensions, each rated by three different rating sources (self, supervisor, and peer). Each of the three measurement equivalence models is based on the general CFA model presented in Figure 1. The three models roughly correspond to the traditional conceptualization of measurement equivalence in terms of congeneric, tau-equivalent, and parallel tests, respectively (Nunnally & Bernstein, 1994) and thus represent sequentially more stringent levels of measurement equivalence. The first model in the sequence (the model actually presented in Figure 1) provides a test of configural invariance (Steenkamp & Baumgartner, 1998). In essence, this model posits an equivalent factor structure (i.e., items load on appropriate performance dimensions) across rating source. This model specifies that (a) ratings of the same dimension (across rating sources) measure a common performance dimension, (b) the loadings of the three ratings from the three different rating sources with respect to the performance dimension are of potentially different magnitudes (i.e., unconstrained), and (c) the unique variances of the ratings are of potentially different magnitudes (i.e., unconstrained). If the first model is supported, the next model in the sequence provides a test of metric invariance in which factor loadings for like dimensions are constrained to be equal across rating source (Steenkamp & Baumgartner, 1998). The question is whether the relation between specific rating items and the underlying performance dimension tapped by the item is the same across rating source. This model specifies that (a) ratings of the same dimension (across rating sources) measure a common performance dimension, (b) the loadings of the three ratings from the three different rating sources with respect to the performance dimension are of equal magnitudes (i.e., constrained to be equal), and (c) the unique variances of the ratings are of potentially different magnitudes (i.e., unconstrained). A lack of support for this model would indicate that item loadings are not equivalent across groups. Both configural and metric invariance have been identified as necessary conditions for measurement equivalence (Cheung, 1999; Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000). That is, these tests provide a basis for ascertaining whether measures are conceptually equivalent across groups. Demonstrating conceptual equivalence is necessary for using and comparing results from a measure across groups. Consequently, if configural or, subsequently, metric invariance is not supported, no substantive between-groups comparisons are warranted because the underlying performance dimensions are defined or understood differently across rating sources. In practical terms, a lack of conceptual equivalence presents an apples and oranges dilemma. That is,
RESEARCH REPORTS
595
which the error variance (variance not attributed to either performance dimension or rating source) associated with the rating items is equivalent across sources. Thus, we evaluate a third model, which specifies that (a) ratings of the same dimension (across rating sources) measure a common performance dimension, (b) the loadings of the three ratings from the three different rating sources are of equal magnitudes with respect to the performance dimension (i.e., constrained to be equal), and (c) the unique variances of the ratings are of equal magnitudes (i.e., constrained to be equal). Unlike configural and metric invariance, however, error variance equivalence is not a necessary condition for meaningful crosssource comparisons. Rather, support for this model would suggest that the performance ratings contain the same amount of error variance across rating sources (i.e., the different sources are equally reliable). In addition to the models described above, two other models are specified for comparative purposes. First, we evaluate a null model, which postulates that each rating measures a different latent variable (i.e., there is no commonality across the ratings sources). Second, we also evaluate a performance-dimension-only model, in which rating source factors are excluded. Finally, by using the MTMR approach, in addition to assessing whether the performance dimension loadings are equivalent across rating sources, one may also compare the relative magnitudes of the standardized parameter estimates for the different components. This comparison provides an indication of the relative impact of each of the three components. Thus, our goal is not simply to look at levels of measurement equivalence across rating sources but to examine the nature and relative impact of different components on the observed multisource ratings.
Method Participants
Data for this study were collected as part of the U.S. Air Force Job Performance Measurement (JPM) project (Hedge & Teachout, 1986, 1992). The primary focus of the JPM project was the development of a variety of job performance measures for use in the validation of selection and classification methodologies and the evaluation of training programs. Given this focus, all performance measures, including performance rating measures, were specifically developed for the project and were used solely for research purposes (there were no personnel administrative uses of the ratings). Job performance ratings were obtained from airmen, immediate supervisors of the airmen, and peers and coworkers of the airmen. The sample in the present study consisted of 1,028 airmen from seven different Air Force job categories (aerospace ground equipment mechanic, n 247; aircrew life support specialist, n 179; air traffic control operator, n 176; avionic communications specialist, n 86; equipment laboratory specialist, n 135; jet engine mechanic, n 19; and personnel specialist, n 186). All airmen were in their first 4-year enlistment term and had between 6 months and 4 years of job experience. Participants rated themselves, and each participant was rated by one supervisor and one peer. We identified a peer rater as a member of the ratees work group performing the same job under the same conditions (i.e., at the same location, during the same shift, etc.). In addition, all raters received frame-ofreference-based rater training prior to providing ratings.
Figure 1. Multitraitmultirater confirmatory factor analytic model. The eight performance dimensions, technical knowledge (TKN), initiative and effort (IEF), knowledge of and adherence to regulations and orders (KRG), integrity (INT), leadership (LDR), military appearance (MAP), selfdevelopment (SDS), and self-control (SCL), are each rated by three different rating sources, self, supervisor (SUPER), and peer, such that selfrating (SE) of TKN is SETKN, SUPER (SU) rating of TKN is SUTKN, peer (PE) rating of TKN is PETKN, and so on.
any substantive comparisons (including level of agreement) or aggregation of the performance ratings across rating sources would be inappropriate because the ratings would represent different constructs across sources. Similarly, a lack of conceptual agreement would also present major difficulties with respect to any feedback based on the performance dimensions underlying the rating data as the understanding and interpretation of these dimensions would differ depending on the rating source. If both configural and metric invariance are supported, another form of invariance that may be evaluated includes the extent to
Rating Instruments
Ratees were rated on the following eight general job performance dimensions: technical knowledge, initiative and effort, knowledge of and
596
RESEARCH REPORTS
adherence to regulations and orders, integrity, leadership, military appearance, self-development, and self-control. The job performance dimensions were developed to measure factors associated with overall job performance across all air force job categories. The job performance dimensions were identified through factor analysis of occupational survey data. In addition, subject matter experts were used to identify the final set of dimensions as well as specific behavioral descriptions for each of the scale anchors (Augustin, Gillet, Guerrero, & Ey, 1989). Previous research examining the psychometric characteristics of the job performance ratings associated with the JPM project has indicated acceptable levels of both reliability and construct validity (Hedge & Teachout, 1992; Kraiger & Teachout, 1990). Ratings were assessed using a single 5-point adjectivally anchored rating scale (1 never meets acceptable level of proficiency; 5 always exceeds acceptable level of proficiency) corresponding to each of the performance dimensions.
Results
Means, standard deviations, and intercorrelations among the dimensional ratings are presented in Table 1. Examination of the correlations presented in Table 1 indicates low levels of agreement across rating sources with a mean within-dimension, cross-source correlation of .28 (mean supervisorpeer r .33 [SD .05], mean supervisorself r .26 [SD .05], and mean peerself r .24 [SD .08]). In addition, the mean within-rating source, crossdimension rating (r .45) indicated a higher level of agreement within (as opposed to across) rating source. This pattern of results is consistent with previous findings of low levels of cross-source agreement.
Model Evaluation
We analyzed each of the models using a confirmatory factor analytic application of LISREL 8 (Joreskog & Sorbom, 1996). The observed covariance matrix among the 24 performance ratings (8 dimensions 3 sources) served as input and maximum likelihood solutions were used for all of the model evaluations. It is important to note that the three measurement equivalence models represent a parameter-nested sequence, in which models are hierarchically nested from most restricted (error variance invariance) to the next level (metric invariance) to the least restricted (configural invariance). Thus, it is possible to use a difference-ofchi-square test to assess whether the difference in fit between each successive model is significant and thereby determine which model offers the best fit to the data. In analyses examining the difference-of-chi-square test among nested models, it is preferable to accept the most restrictive model (i.e., model with the largest degrees of freedom) that does not result in a significant reduction in fit over less restricted models (Bollen, 1989). Overall fit indices for each of the model evaluations are presented in Table 2. In addition, examination of the model evaluation output indicated that the analyses converged for each of the models and no out-of-bound parameter estimates were generated. As expected, comparison of the performance-dimension-only with the MTMR-based performance dimension and rating source factors model indicates substantially better fit for the MTMRbased model. Although the overall chi-square values were significant for both models, examination of the other fit indices indicate that the MTMR-based model provides a high level of fit with the
data whereas the performance-dimension-only model does not. Specifically, the incremental fit indices (comparative fit index, nonnormed fit index) displayed in Table 2 are above .99 for the MTMR model and below .60 for performance-dimension-only model. Moreover, the root-mean-square error of approximation value for the MTMR model was .017, significantly below .050, compared with a value of .185 for the performance-dimensiononly model. These results indicate that the performance ratings were influenced by both the underlying performance dimension and the rating source. Thus, the MTMR model is the appropriate model from which to examine measurement equivalence. We next evaluated the three measurement equivalence models. In general, all three models provided a high level of fit with the data. Relative comparison of the three models (also presented in Table 2) indicated the lowest chi-square value (and thus best fit) for the configural invariance model. However, the difference in chi-square values between this model and the metric invariance 2 model was not significant, (213 197 16, N 1,028) 271.26 254.88 16.38, p .43. The difference in chi-square values between the metric invariance and error variance invari2 ance models, however, was significant, (229 213 16, N 1,028) 325.64 271.26 54.38, p .01. This indicates that the error variance invariance model represents a significantly lower level of fit to the data, and thus, overall, the metric invariance model provides the best representation of the present data. The individual standardized parameter estimates for the metric invariance model are presented in Table 3. Examination of these estimates reveals that factor loadings for the performance dimensions (presented in the second column of Table 3) ranged from 0.47 to 0.71 with a mean value of 0.58. All of the t values for the performance loadings were significant. The standardized parameter estimates associated with the rating sources are also presented in Table 3. Examination of the parameters indicates that the factor loadings for the rating sources ranged from 0.42 to 0.59 for peers, 0.19 to 0.33 for self, and 0.47 to 0.63 for supervisors with means of 0.51, 0.25, and 0.55, respectively. Parameters associated with unique variance components (presented in the last column of Table 3) were all relatively high, ranging from 0.33 to 0.57 with a mean value of 0.44. Finally, Table 4 presents the proportion of variance accounted for by each source (dimension, rating source, and uniqueness) in each observed rating. Overall, these results indicated that the uniqueness component accounted for the largest share of the rating variance across rating source and dimensions (on average approximately 45% of the variance). Performance dimension, on average, accounted for approximately 34% of the variance and rating source accounted for approximately 21%. In summary, our results suggest that (a) the extent to which ratings reflect the underlying performance dimension is equivalent across the different rating sources, (b) the extent to which ratings are directly influenced by rating source is substantial and generally only slightly smaller than effects attributable to the performance dimensions, (c) the extent to which ratings are directly influenced by rating source differs widely depending on the source, and (d) the largest single influence on ratings are parameters not associated with either the performance dimensions or rating source.
Table 1 Means, Standard Deviations, and Intercorrelations for the Performance Ratings
SD 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Rating source
Performance dimension
RESEARCH REPORTS
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. .32 .11 .38 .18 .08 .29 .10 .06 .28 .11 .16 .44 .21 .09 .33 .12 .16 .38 .18 .03 .25 .08 .21 .17 .52 .21 .13 .50 .17 .14 .46 .31 .26 .59 .16 .09 .35 .24 .15 .50 .16 .16 .41 .21 .32 .53 .13 .26 .52 .03 .21 .54 .14 .22 .40 .10 .19 .51 .11 .22 .44 .11 .24 .26 .18 .39 .19 .11 .43 .17 .09 .39 .19 .17 .37 .22 .17 .42 .23 .13 .33 .15 .23 .12 .60 .23 .11 .57 .26 .14 .56 .23 .12 .46 .27 .16 .59 .21 .16 .49 .24 .30 .56 .06 .27 .47 .13 .23 .48 .13 .25 .46 .13 .22 .56 .19 .31 .22 .12 .40 .18 .15 .32 .17 .16 .39 .20 .13 .31 .20 .17 .41 .20 .24 .16 .63 .28 .13 .55 .24 .13 .51 .22 .13 .55 .27 .24 .58 .10 .25 .52 .07 .21 .44 .06 .21 .47 .10 .19 .56 .13 .29 .18 .04 .32 .16 .09 .35 .15 .07 .31 .14 .06 .38 .14 .23 .14 .55 .20 .06 .48 .20 .08 .53 .25 .21 .54 knowledge of regulations; INT .21 .30 .42 .07 .20 .49 .11 .23 .48 .12 .29 .25 .13 .35 .16 .14 .34 .16 .10 .30 .14 .21 .15 .45 .21 .12 .54 .21 .19 .51 integrity; LDR .32 .38 .43 .16 .21 .42 .14 .26 initiative/effort; KRG .30 .12 .33 .14 .04 .30 .10 Supervisor; IEF
Peer Self Superv. Peer Self Superv. Peer Self Superv. Peer Self Superv. Peer Self Superv. Peer Self Superv. Peer Self Superv. Peer Self Superv.
TKN TKN TKN IEF IEF IEF KRG KRG KRG INT INT INT LDR LDR LDR MAP MAP MAP SDS SDS SDS SCL SCL SCL
3.73 3.86 3.62 3.44 3.71 3.39 3.56 3.70 3.46 3.73 4.00 3.76 3.40 3.86 3.31 3.69 3.76 3.47 3.54 3.71 3.47 3.83 4.11 3.80
0.85 0.77 0.90 0.93 0.77 1.00 0.89 0.80 0.92 0.93 0.78 1.00 0.92 0.73 0.92 0.92 0.81 0.86 0.80 0.73 0.86 0.95 0.77 0.94
.38 .41 .47 .17 .25 .44 .11 .22 .41 .07 .21 .53 .28 .28 .36 .14 .17 .47 .17 .22 .35 .12 .22
.22 .16 .50 .21 .19 .45
.22 .30 .42 .18 .26 leadership; MAP
.25 .09 .31 .12
.20 .20 .50
.25 .38
.29 military appearance; SDS
Note. TKN technical knowledge; Superv. self-development; SCL self-control.
597
598
RESEARCH REPORTS
Table 2 Goodness-of-Fit Statistics for the Performance Rating Factor Models

2 2
Model Null Perf. dimension factors only
(N
1,028)
df 276 224
(N
1,028)a
df 52
RMSEA .190 .185
ECVI 10.27 7.99
CFI .595
NNFI .501
10,502.76 8,055.35
2,447.41
Perf. dimension and rating source factors Error variance invariance Metric invariance Configural invariance 325.64 271.26 254.88 229 213 197 101.77 54.38 16.38 47 16 16 .020* .017* .017* .457 .436 .448 .991 .995 .997 .991 .995 .997
2 Note. 1,028) 26.30, p .05. RMSEA root-mean-square error of approximation; ECVI crit(16, N error covariance variance invariance; CFI comparative fit index; NNFI nonnormed fit index; Perf. performance. a 2 The difference in to the next more restricted model (e.g., the configural invariance model is compared with the metric invariance, the metric invariance model with the error variance invariance, the error variance invariance with the null, and the perf. dimensions only with the null). * p .05.
Discussion
Table 3 Standardized Parameter Estimates for the Factor Loading and Unique Variance Parameters Derived From the Metric Invariance CFA Model
Parameter loadings Performance dimension Peer TKN Self TKN Superv. TKN Peer IEF Self IEF Superv. IEF Peer KRG Self KRG Superv. KRG Peer INT Self INT Superv. INT Peer LDR Self LDR Superv. LDR Peer MAP Self MAP Superv. MAP Peer SDS Self SDS Superv. SDS Peer SCL Self SCL Superv. SCL M Dimensiona 0.65 0.71 0.60 0.55 0.61 0.52 0.55 0.61 0.52 0.50 0.59 0.47 0.52 0.64 0.51 0.61 0.69 0.59 0.56 0.61 0.52 0.54 0.66 0.54 0.58 Peera 0.43 0.23 0.51 0.55 0.26 0.51 0.53 0.23 0.62 0.59 0.33 0.63 0.56 0.23 0.58 0.42 0.28 0.47 0.48 0.23 0.58 0.52 0.19 0.51 0.25 0.53 0.55 Selfa Superv.a Unique varianceb 0.39 0.45 0.38 0.40 0.48 0.35 0.42 0.57 0.33 0.40 0.54 0.38 0.42 0.53 0.39 0.45 0.45 0.44 0.46 0.57 0.39 0.44 0.53 0.43 0.44
Note. CFA confirmatory factor analysis; Superv. supervisor; TKN technical knowledge; IEF initiative/effort; KRG knowledge of regulations; INT integrity; LDR leadership; MAP military appearance; SDS self-development; SCL self-control. a LISREL lamba x (LX) completely standardized parameter values. LX values within performance dimensions were constrained to be equal. This equality constraint is reflected in the unstandardized parameter estimates. b LISREL diagonal theta delta completely standardized parameter values.
In the present study we extend the literature on cross-source measurement equivalence by (a) comparing the fit of a model of multisource performance ratings that includes performance-dimension factors only with an MTMR-based model in which ratings are a function of both performance dimension and rating-source factors and (b) evaluating the degree of measurement equivalence across rating source within the context of the MTMRbased model. Our results represent a good news, bad news scenario for organizations that use multiple rating source feedback systems. First, the good news is that different rating sources are, at least to some extent, equivalent with respect to the performance constructs being rated. This suggests that it is possible to make meaningful comparisons across sources and that, with enough raters, it would be possible to converge on a reliable measurement of the performance dimension of interest. In addition, the effects associated with the performance dimensions of interest are generally larger than those associated with source-specific effects. In more general MTMM terms, trait (dimension) effects are generally larger than method (source) effects. This suggests that lack of agreement across rating sources is less the result of source-specific effects and more likely the result of unique effects associated with the construct being rated. It appears that different rating sources are able to assess commonalities in performance to an equivalent extent and to an extent greater than that associated with source-specific effects. The bad news is that the impact of the performance dimension on ratings pales in relation to other sources of variance. Two thirds of the variance in our performance-rating data are attributable to the combination of source-specific effects and other unique effects. An important unanswered question has to do with the nature of the unique variance in the ratings. Two factors may have contributed to these effects. First, these effects may reflect random error variance in the ratings and thus provide an indication of reliability (or lack thereof). Given that each of the dimensions was assessed with a single rating item, there is likely a good deal of random measurement error. These effects may also reflect systematic vari-
RESEARCH REPORTS
599
Table 4 Proportion of Variance Attributable to Performance Dimension, Rater Source, and Uniqueness Derived From the Metric Invariance CFA Model
Variance source Performance dimension Dimensiona Peera Selfa Superv.a Uniqueness
Peer rating TKN IEF KRG INT LDR MAP SDS SCL M .43 .30 .30 .25 .27 .37 .31 .29 .32 .19 .30 .28 .35 .31 .18 .23 .27 .26 Self-rating TKN IEF KRG INT LDR MAP SDS SCL M .50 .37 .37 .35 .41 .48 .37 .44 .41 .05 .07 .05 .11 .05 .08 .05 .04 .06 Superv. rating TKN IEF KRG INT LDR MAP SDS SCL M .36 .27 .27 .22 .26 .35 .27 .29 .29 .26 .26 .38 .40 .34 .22 .34 .28 .31 .38 .35 .33 .38 .39 .44 .39 .43 .40 .45 .48 .57 .54 .53 .45 .57 .53 .52 .39 .40 .42 .40 .42 .45 .46 .44 .42
Note. CFA confirmatory factor analysis; Superv. supervisor; TKN technical knowledge; IEF initiative/effort; KRG knowledge of regulations; INT integrity; LDR leadership; MAP military appearance; SDS self-development; SCL self-control. a Proportion of variance accounted for by performance dimensions and rating source is represented as the squared parameter loadings based on the tau-equivalent model.
ance not attributable to either commonalities in the performance dimension or to the rating source. This systematic variance may reflect the unique aspects of performance observed and rated by each rating source. For example, all raters may have had the opportunity to observe some aspects of performance. They may also have had the opportunity to observe unique aspects of performance. This would be reflected in the pattern of results obtained. Of course it is most likely that the unique effects represent both systematic and random error components. To the extent that these effects are in fact attributable to the opportunities of the different rating sources to observe and evaluate unique aspects of performance associated with each dimension, it would greatly support the value of incorporating multiple perspectives in the feedback system. That is, the different rating
sources together provide a more complete picture of the target ratees performance. From this perspective, lack of agreement across sources would not be considered negative and should be expected. If, however, the unique effects associated with the construct being measured are simply the result of random error, it would suggest that unreliability in ratings is the major contributor to lack of agreement across sources. This too, however, would still support the value of incorporating multiple rating sources. That is, the use of multiple raters should provide a more reliable measure of performance than would a single rater. Certainly, an important issue for future research is the disentanglement of these effects. One suggestion for how this issue might be addressed is to use confirmatory factor analytic techniques to look at the relationship of these unique effects with other criterion measures (e.g., work samples, job knowledge tests). To the extent that these effects reflect systematic performance-related variance, they should correlate with other criterion measures, and to the extent that they represent random effects they should not. This approach might also be extended to the latent variables representing the performance dimensions as well. That is, correlations of these latent variables with other criterion (or predictor) measures might also be examined as an indication of which of the dimensions might be most related to other aspects of performance. As with any study, it is important to recognize potential limitations. A potential source of concern is that the present study is based on a single sample. We believe that several factors mitigate this limitation. First, albeit singular, the present sample was quite large and thus allows for a more methodologically optimal evaluation of the models, especially given the sample sizes in other MTMR studies. Second, and more important, the present sample actually combines data across seven different jobs ranging from personnel specialists to jet engine mechanics and thus incorporates a relatively diverse set of occupations. Another potential limitation is that the ratings in the present study were collected for research purposes and were not used in any administrative context. On the one hand, this may limit the generalizability of findings to ratings that may be used for administrative purposes. It is possible that ratings collected for administrative purposes may be more a function of rating source or unique effects. On the other hand, we believe that the majority of multisource performance ratings are collected for feedback and development purposes as opposed to administrative decision making. Regardless, it is likely that our rating data represent an optimal picture of the impact of the underlying performance dimensions. Finally, another potential limitation is that all of our raters received frame-of-reference training with respect to the dimensions of interest. This may serve to increase the overall impact of the performance dimensions on the ratings (as well as increase synchronization across sources) and thus limit the generalizability of our results with respect to multisource rating systems in which raters do not receive such training. Here we would like to note, however, that rater training should be incorporated into any performance-rating system. Again, our results likely represent an optimal picture of the impact of the underlying performance dimensions. In summary, the present study represents an attempt to empirically model the relations among ratings of multiple performance dimensions by multiple rating sources. We demonstrated that the traditional linear CFA model provides the best representation of the MTMR data analyzed. The present study also attempts to
600
RESEARCH REPORTS sample criterion measurement. Journal of Applied Psychology, 77, 453 461. Joreskog, K. G., & Sorbom, D. (1996). LISREL 8: Users reference guide [Computer software]. Chicago: Scientific Software International. Kenny, D. A., & Kashy, D. A. (1992). Analysis of multitraitmultimethod matrix by confirmatory factor analysis. Psychological Bulletin, 112, 165172. Kraiger, K., & Teachout, M. S. (1990). Generalizability theory as construct-related evidence of the validity of job performance ratings. Human Performance, 2, 249 272. Lance, C. E., Baxter, D., & Mahan, R. P. (in press). Evaluation of alternative perspectives on source effects in multi-source performance measures. In W. Bennett Jr., C. E. Lance, & D. J. Woehr (Eds.), Performance measurement: Current perspectives and future challenges. Mahwah, NJ: Erlbaum. Lance, C. E., & Bennett, W., Jr. (1997, April). Rater source differences in cognitive representations of performance dimensions. Paper presented at the annual meeting of the Society for Industrial and Organizational Psychology, St. Louis, MO. Lance, C. E., Teachout, M. S., & Donnelly, T. M. (1992). Specification of the criterion construct space: An application of hierarchical confirmatory factor analysis. Journal of Applied Psychology, 77, 437 452. Lance, C. E., & Woehr, D. J. (1986). Statistical control of halo: A clarification from two models of the performance appraisal process. Journal of Applied Psychology, 71, 679 685. Lance, C. E., & Woehr, D. J. (1989). The validity of performance judgments: Normative accuracy model versus ecological perspectives. In D. F. Ray (Ed.), Southern Management Association Proceedings (pp. 115117). Oxford, MS: Southern Management Association. Mabe, P. A., III, & West, S. G. (1982). Validity of self-evaluation of ability: A review and meta-analysis. Journal of Applied Psychology, 67, 280 296. Marsh, H. W. (1989). Confirmatory factor analysis of multitraitmultimethod data: Many problems and a few solutions. Applied Psychological Measurement, 13, 335361. Marsh, H. W., & Bailey, M. (1991). Confirmatory factor analysis of multitrait-multimethod data: A comparison of alternative models. Applied Psychological Measurement, 15, 4770. Maurer, T., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate performance appraisal measurement equivalence. Journal of Applied Psychology, 83, 693702. Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait, rater, and level effects in 360-degree performance ratings. Personnel Psychology, 51, 557576. Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational, and goal based perspectives. Thousand Oaks, CA: Sage. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78 90. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4 69. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557574.
address the measurement equivalence of multiple rating sources and suggests that different rating sources are, at least to some extent, equivalent with respect to the performance constructs being rated. This is important because if the constructs being rated are not on the same psychological measurement scale, observed differences across rating sources would be difficult or impossible to interpret. We believe that the present study sheds some light on the nature of the convergence, or lack thereof, of ratings across sources. Future research is needed to further elucidate this issue.
References
Ashford, S. J. (1989). Self-assessment in organizations: A literature review and integrative model. In L. L. Cummings & B. M. Staw (Eds.), Research in organizational behavior (Vol. 11, pp. 133174). Greenwich, CT: JAI Press. Augustin, J. W., Gillet, A. H., Guerrero, Jr., C., & Ey, A. W. (1989). Documentation of the job performance measurement system database contents (UES Rep. No. 788 038). Dayton, OH: Universal Energy Systems. Bollen, K. (1989). Structural equations with latent variables. New York: Wiley. Bracken, D. W., Timmreck, C. W., & Church, A. H. (2001). The handbook of multisource feedback. San Francisco, CA: Jossey-Bass. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod method matrix. Psychological Bulletin, 56, 81105. Cheung, G. W. (1999). Multifaceted conceptions of self-other ratings disagreement. Personnel Psychology, 52, 136. Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 127. Church, A. H., & Bracken, D. W. (1997). Advancing the state of the art of 360-degree feedback. Group and Organization Management, 22, 149 161. Conway, J. M. (1996). Analysis and design of multitrait-multirater performance appraisal studies. Journal of Management, 22, 139 162. Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. Human Performance, 10, 331360. Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90, 218 244. Coovert, M. D., Craiger, J. P., & Teachout, M. S. (1997). Effectiveness of the direct product versus confirmatory factor model for reflecting the structure of multimethodmultirater job performance data. Journal of Applied Psychology, 82, 271280. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86, 215227. Goffin, R. D., & Jackson, D. N. (1992). Analysis of multitrait-multirater performance appraisal data: Composite direct product method versus confirmatory factor analysis. Multivariate Behavioral Research, 27, 363385. Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of selfsupervisor, self-peer, and peer-supervisor ratings. Personnel Psychology, 41, 43 62. Hedge, J. W., & Teachout, M. S. (1986). Job performance measurement: A systematic program of research and development (Rep. No. AFHRLTP-86 87). Brooks Air Force Base, TX: Air Force Human Resources Laboratory, Training Systems Division. Hedge, J. W., & Teachout, M. S. (1992). An interview approach to work
Received November 18, 2002 Revision received December 17, 2003 Accepted January 9, 2004

Assessing Measurement Equivalence Across Rating Sources

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Assessing Measurement Equivalence Across Rating Sources

Diunggah oleh

Hak Cipta:

Format Tersedia

Journal of Applied Psychology 2005, Vol. 90, No.

Copyright 2005 by the American Psychological Association 0021-9010/05/$12.00 DOI: 10.1037/0021-9010.90.3.592

Assessing Measurement Equivalence Across Rating Sources: A MultitraitMultirater Approach

Winston Bennett Jr.

Models of Measurement Equivalence

.22 .16 .50 .21 .19 .45

.22 .30 .42 .18 .26 leadership; MAP

.25 .09 .31 .12

.20 .20 .50

.29 military appearance; SDS

Note. TKN technical knowledge; Superv. self-development; SCL self-control.

Table 2 Goodness-of-Fit Statistics for the Performance Rating Factor Models

Model Null Perf. dimension factors only

RMSEA .190 .185

ECVI 10.27 7.99

Anda mungkin juga menyukai