www.elsevier.com/locate/aos
Abstract
We examine judgmental effects of the balanced scorecard’s organization. The balanced scorecard contains a large
number of performance measures divided into four categories. We examine whether the scorecard’s organization
results in managerial performance evaluation judgments consistent with a recognition of the potential relations (i.e.
nonindependence) of measures within a category. Supporting this idea, we find that performance evaluations are
affected by organizing the measures into the balanced scorecard categories when multiple below-target (or above-tar-
get) measures are contained within a category but that evaluations are not affected when the above/below-target mea-
sures are distributed across the scorecard’s four categories. # 2002 Elsevier Science Ltd. All rights reserved.
category are used to make an assessment of the necessary for implementing a BSC. The BSC,
category and these four assessments are then com- according to Kaplan and Norton, should contain
bined. In assessing each category, decision makers measures related to financial performance (e.g.
are primed to see relations among the measures return on assets), customer relations (e.g. custo-
within each group (Hopkins, 1996). When perfor- mer satisfaction surveys), internal business pro-
mance on measures within a group is consistent cesses (e.g. process efficiency measures), and
(e.g. consistently above-target), the decision maker learning and growth in the organization (e.g.
may perceive that the measures are related (i.e. not employee capability measures). Kaplan and Nor-
independent) and consequently, reduce the impact ton (1993, 1996b) view the scorecard as a strategic
of the individual measures on his or her judgment. management tool that should explicate the drivers
In contrast, when the same measures are presented of performance, as well as provide measures of
without the organizing BSC categories (or are performance. This study focuses on the score-
scattered across BSC categories), the perception of card’s use in evaluation and decision making.
relations among these measures and the resulting
reduction in decision weights are less likely. 2.2. Cognitive limitations and the divide and
Our results show that when multiple measures conquer decision strategy
within a BSC category show consistent perfor-
mance (e.g. above-target), managers’ evaluation The balanced scorecard with its large number of
judgments are reliably different from evaluations performance measures presents a complex task to
made using these same measures without the BSC a manager asked to use the scorecard to evaluate a
format. These judgment differences disappear division’s performance. The manager could, theo-
when the measures indicating strong performance retically, weight and combine the many measures
are distributed throughout the four BSC cate- into an overall evaluation of the business unit but
gories instead of being found in a single BSC this is, cognitively, a very difficult thing to do.
category. Although it is difficult to state with cer- Research in cognitive psychology has repeatedly
tainty that the BSC results in judgment improve- shown that humans are able to retain and use only
ments, this study provides evidence that the BSC a small number of items in working memory
has predictable and understandable effects on (Baddeley, 1994; Miller, 1956). With this limit on
judgment. While these grouping effects may occur working memory, holding 20 or more individual
with other types of categorizations, other group- measures in one’s head and mentally manipulating
ings have not received the same kind of attention them simultaneously is extremely difficult, if not
as those in the BSC. impossible. Thus, the volume of data in a balanced
The remainder of the paper is organized as fol- scorecard suggests that it may overload human
lows. In the next section we will briefly describe decision makers with information.
the BSC, review applicable judgment and decision The balanced scorecard’s four categories suggest
making research, and present a two-part research a way for managers to mentally organize the large
prediction. Section three describes the experi- number of performance measures that may miti-
mental work used to test our research predictions gate this cognitive difficulty. Prior studies show
and the final section summarizes the conclusions that information processing and judgments are
that can be drawn from the study. affected by information organization (Bettman &
Kakkar, 1977; Payne, Bettman, & Johnson, 1993)
and by the hierarchies or relations among infor-
2. Background mation items contained in a decision task (Klein-
muntz & Schkade, 1993). For example, Hopkins
2.1. The balanced scorecard (1996) showed that placing an item (e.g. preferred
stock) in a particular category (e.g. liabilities)
In a best-selling book Kaplan and Norton caused experienced professionals to perceive that
(1996a) describe the methods and procedures the item was related to others in the category.
M.G. Lipe, S. Salterio / Accounting, Organizations and Society 27 (2002) 531–540 533
These studies suggest that when data items are the above discussion suggests that judgments made
grouped in ways meaningful to the decision with the BSC will not differ from those made with
maker, they may be combined prior to further use uncategorized lists of measures in situations where
(Chase & Simon, 1973). Shanteau (1988) describes multiple above-target (or below-target) measures
this method of using information as ‘divide and are scattered across the BSC categories.
conquer.’ The information is divided into groups, Although performance results on the twenty or
an assessment can be made of each group, and more performance measures in a BSC may take on
these assessments can then be combined. The any number of patterns, we will test the impact of
organization of the BSC lends itself quite naturally only the two extreme patterns described above.
to this kind of mental approach. That is, we will consider a situation where multiple
above-target (or below-target) measures are con-
2.3. Perceived relations among measures tained within one BSC category and then we will
contrast that with the situation where the above-
When using the BSC, the initial stage of the target (below-target) measures are distributed
divide and conquer decision strategy is to use across categories. For these two situations, we will
measures within a category to assess performance compare the judgments for decision makers with
in that area (e.g. financial performance). Since the the BSC to those of decision makers using the
measures have been grouped together, the decision same measures without the BSC categories. Our
maker will be expecting and seeking relations research predictions are:
between them (Hopkins, 1996; Maines & McDan-
iel, 2000). If performance on these measures Evaluations using the balanced scorecard will
confirms this expectation (e.g. by indicating differ from evaluations based on the same mea-
consistently good performance), the decision sures without the scorecard organization,
maker may reasonably reduce the decision weight depending on the pattern of performance across
placed on each individual measure due to per- categories. Specifically:
ceived correlations (nonindependence) of the
measures (Banker & Datar, 1989; Feltham & Xie, 1. judgments are likely to be moderated when
1994). In contrast, if measures indicating good multiple above-target (or below-target)
performance are scattered across BSC categories measures are contained in a single BSC
(or contained in uncategorized lists of measures), category but,
the decision maker is less likely to expect and per- 2. judgments are unlikely to be affected when
ceive these measures to be correlated and to make multiple above-target (or below-target)
consequent reductions to their decision weights. measures are distributed throughout the
This is consistent with findings in psychology that BSC categories.
people find it difficult to recognize that correla-
tions exist unless they have theories suggesting The next section describes the experiments and the
such relations (Jennings et al., 1982) and with test results.
Maines’ (1990) findings in accounting that judg-
mental discounting for information redundancy
(i.e. correlation) does not occur unless the judge is 3. Method and results
alerted to the presence of such relations (see,
especially, her experiment three). 3.1. Overview of experiments
This suggests that judgments made with the
BSC will differ from those made with uncategor- Participants are presented with a case where
ized lists of measures in particular situations: they are asked to take the role of a senior executive
those cases where performance on measures within of WCS Incorporated, a firm specializing in retail-
a category are consistent (i.e. consistently above- ing women’s apparel. WCS has multiple divisions,
target or consistently below-target). Additionally, the two largest of which are the focus of the case
534 M.G. Lipe, S. Salterio / Accounting, Organizations and Society 27 (2002) 531–540
materials. The case introduces the managers of the priate for retailers and capture the two different
two business units and the strategies of the units strategies.
are described. Multiple performance measures are
presented in patterns and formats depending on 3.2. Experiment one
the experimental treatment as described below.
The participant is then asked to evaluate the per- In experiment one we focus on whether the BSC
formance of each of the two unit managers on a format makes a difference in divisional manager
scale with seven descriptive labels and numerical performance evaluation when particularly good or
endpoints of ‘‘0’’ and ‘‘100’’ (see Table 1 for a bad performance is contained in one BSC cate-
sample evaluation form). gory. In this situation, we predict that the BSC
After providing the manager evaluations, the categorization primes the evaluator to perceive
participants complete a questionnaire. This consistent performance as evidence of correlation
questionnaire asks for demographic information, among measures, which may reduce the impact of
provides manipulation checks (discussed further the individual good or bad measures. This per-
in Sections 3.2.3 and 3.3.2), and gathers data ceived correlation will moderate judgments rela-
regarding task difficulty, realism, and under- tive to those made without the BSC organization.
standability.
In both experiments the two divisions described 3.2.1. Subjects, design, and procedures
are RadWear and PlusWear, retail divisions spe- Seventy-eight MBA students served as experi-
cializing in clothing for the urban teenager and in mental participants. The students had, on average,
large-sized clothing, respectively. The participants 4 years of work experience and 62% were male.
are informed that management believes the per- All participants received a diverse set of per-
formance measures for each division are appro- formance measures, a description of how the
Table 1
Sample evaluation form employed in both experiments
WCS Inc.
Initial Evaluation Form
Year: 1996
Manager: Chris Peters
Division: RadWear
Evaluator:
1. Indicate your initial performance evaluation for this manager by placing an ‘X’ somewhere on the scale below. Note that some label
interpretations are provided below.
Table 2
RadWear balanced scorecarda (PlusWear items in parentheses)
Financial
1. Return on sales 24% (22) 25% (23)
2. Sales growth 35% (30) 38% (33)
3. New store sales (new lines sales) 30% (25) 26% (22)
4. Market share relative to retail space $80 (70) $80 (70)
5. Return on expenses 42% (36) 42% (36)
Customer-related
1. Repeat sales 30% (40) 33% (36)
2. Customer satisfaction rating 95 (97) 96 (96)
3. Mystery shopper program rating 96 (96) 98 (94)
4. Returns by customers as % of sales 10% (7) 9% (8)
5. Out of stock items 10% (14) 10% (14)
measures were calculated, and the comparison of sures, and learning measures) while other
each measure to its expectation or target for each participants received the same set of 20 measures
of the two divisions (see Table 2 for the BSC ver- without the BSC format (NOFORM group). For
sion of the task).1 Further, all participants were the NOFORM group the measures were presented
told that the performance measures were ‘‘care- in one of two orders, alphabetical or random.2 In
fully chosen to represent important aspects of a addition to the format manipulation across sub-
business unit[’s performance]’’ and were ‘‘drivers ject groups, the order of presentation of the two
of the unit’s success and linked to its strategy and divisions (i.e. RadWear and PlusWear) was coun-
mission.’’ terbalanced across subjects within each format
The between-subjects (Ss) manipulation was the group.
organization of the performance measures. The For all participants, the financial measures indi-
BSC group received the 20 measures divided into cated that performance was somewhat above
the four BSC categories (financial measures, cus- expectations for both divisions (note in Table 2
tomer satisfaction measures, operational mea- that two financial measures were above-targets,
2
The order of measures for the latter was chosen by ran-
1
Participants received separate exhibits for RadWear and dom draw with the only proviso that adjacent measures should
PlusWear (and none of their measures were shown in bold). not come from the same BSC category. Two orders were used
Measures for both divisions are included in Table 2 for effi- for the NOFORM group to increase the generalizability of
ciency of exposition. results.
536 M.G. Lipe, S. Salterio / Accounting, Organizations and Society 27 (2002) 531–540
Table 4
Descriptive statistics for experiment one manager evaluations—means (standard deviations)
alphabetic versus the random order for any of these 24.29 points higher than PlusWear’s. When the
questions (all P-values> 0.10) or for the manage- multiple positive/negative measures all related to
rial evaluations. Also, the order of the presentation customer relations, they led to less extreme eval-
of divisions had no effects on responses to the uations when they were displayed in this BSC
manipulation check questions (all P-values> 0.10). category as opposed to being distributed through-
Although division order was not related to the out the unorganized list.
hypotheses, it did interact with division in affect-
ing performance evaluations (F=10.57, P < 0.01). 3.2.4. Supplemental analysis
Thus, division order is included in the statistical Since our research predictions were based on the
analysis but it is not discussed further.6 idea that subjects with BSC categories would use a
Analyzing the individual manager evaluations divide and conquer strategy, we expect to see dif-
via a repeated measures 222 analysis of var- ferences in the patterns of data processing for BSC
iance (ANOVA) with scorecard organization and versus NOFORM subjects. To test this, 64 of the
division order as between-Ss factors and division subjects in experiment one provided memos
as a within-Ss factor (see Table 3), indicates sta- explaining each of their managerial evaluations.
tistically significant effects for division (F=97.14, The NOFORM subjects mentioned, on average,
P < 0.01) and the interaction of division and orga- 22.6 individual performance measures in these
nization (F=5.70, P< 0.02). These results show memos. They also referred to self-generated
that RadWear’s manager is evaluated higher than groups of measures 1.1 times, on average, with the
PlusWear’s (means [standard deviations] of 71.76 most common group of measures mentioned being
[12.76] and 51.42 [17.34], respectively) and that the a self-generated customer-related grouping. In
scorecard’s organization affects the relative eval- total, about 95% of the items mentioned by these
uations of the two divisional managers. subjects were individual measures. The BSC sub-
The interaction of organization and division jects referred to an average of 18.7 individual
supports our first research prediction, showing measures. They also mentioned 8.1 groups of
that information’s organization affects the relative measures (including multiple mentions of some
evaluations of the managers for this pattern of groups) and most of these (98%) were the BSC
performance results where particularly positive/ categories. The most common group mentioned
negative performance is concentrated in one BSC was the customer-related BSC category. The pat-
category. As shown in Table 4, participants with terns of usage of group versus individual measures
the four category organization of measures eval- are significantly different for the BSC and
uated RadWear’s manager 17.27 points higher NOFORM subjects [w2(1)=249.16, P < 0.01].
than PlusWear’s while participants with the Logically, the BSC organization led to increased
NOFORM format evaluated RadWear’s manager consideration of groups of measures. Thinking
about the measures in these groups led to moder-
6
ated judgments when measures with above-target
Specifically, when PlusWear was rated first, the average
evaluations were 67.97 and 54.97 for RadWear and PlusWear, (or below-target) performance were contained
respectively. When RadWear was rated first these average eva- within a single BSC category and may have been
luations were 74.39 and 48.96. correlated (nonindependent).
538 M.G. Lipe, S. Salterio / Accounting, Organizations and Society 27 (2002) 531–540
3.3. Experiment two divisions beat their targets for one (non-DIFF)
measure and met their targets on the other three.7
In experiment two we focus on whether the BSC The DIFF measures were in positions 1, 7, 13, and
format makes a difference in divisional manager 19 (out of 20) in the BSC format and 2, 4, 12, and
performance evaluation when particularly positive 19 in the NOFORM format. This NOFORM
(or negative) performance measures are dis- positioning is the same as those used in the ran-
tributed across all four BSC categories. Experi- dom listing in experiment one.
ment one shows that the BSC organization can
impact judgments given a particular pattern of 3.3.2. Results
performance results (i.e. where consistently posi- As in experiment one, subjects with the BSC
tive or negative performance is found in one cate- format judged the performance measures to be
gory). In contrast, our second research prediction more logically organized and usefully categorized
posits that the BSC’s organization will not affect than those with the NOFORM format (both
judgments when the multiple positive/negative P < 0.01). The 222 repeated measures ANOVA
measures are distributed across categories (so that indicates only two statistically significant effects:
the categorization does not suggest the DIFFerent Division (F=24.30, df=1.67, P< 0.01) with Rad-
measures are related). Wear evaluated higher than PlusWear (means
[standard deviations] of 69.49 [11.97] and 61.44
3.3.1. Subjects, design, and procedures [14.36], respectively) and Order (F=3.91, df=1,67,
Seventy-one students in graduate-level manage- P=0.05).8 The format of the performance mea-
rial accounting courses participated as subjects. sures did not affect the judgments (F=0.20), nor
The students had average work experience of 5 did it interact with division (F=0.48). Thus, in
years and 58% were male. The dependent measure contrast to results for experiment one, with this
for experiment two is the same as that of experi- pattern of performance results (i.e. with the above/
ment one, the managerial evaluations. below-target measures distributed across BSC
Similarly to experiment one, a 222 repeated categories), the BSC format did not affect the
measures design was used. Performance measure evaluations of the managers.
organization was varied between Ss; students The experiments indicate that, dependent on the
either saw the measures in the BSC format or the pattern of performance results, organizing mea-
NOFORM (random only) format. The order of sures into the BSC can affect managerial judg-
presentation of RadWear and PlusWear was again ments. This may be caused by information
varied between Ss. The within-Ss factor was divi-
7
sional performance on four DIFF measures, with Although it would have been ideal to use the same DIFF
RadWear superior to PlusWear on all four. In measures in experiments one and two, it was not possible to do
this while credibly placing these measures all into one BSC
contrast to experiment one, here these DIFF category in one while distributing them across categories in the
measures were distributed across the BSC cate- other. Instead, experimental control was maintained by holding
gories. They were Return on Sales, Mystery many other things constant across the two experiments. For
Shopper program rating, Average major brand example, in both experiments RadWear (PlusWear) beat its
names/store (or Average% of product range), and target on eight (four) measures, missed it on three (seven), and
tied it on nine (nine). Summing up the percentage above or
Hours of employee training/employee. For all
below target on each measure, RadWear (PlusWear) beat its
other measures, RadWear and PlusWear per- targets by a sum of 12.5% ( 30.31%) in experiment one and
formed similarly relative to their targets; for the 13.46% ( 27.05%) in experiment two. The difference in sum-
financial measures, both divisions were above tar- med percentage from target for RadWear minus PlusWear was
get on one (non-DIFF) measure, below target on 42.81% in experiment one and 40.51% in experiment two.
Thus, the relative performance of the two divisions was similar
one, and met the target on the other two. This across the experiments.
pattern was repeated for the customer-related 8
When PlusWear was evaluated first, mean evaluations
category and for the internal business processes were 68.09 versus 62.92 when RadWear was evaluated first.
group. In the learning and growth category, both Order did not interact with any other factors.
M.G. Lipe, S. Salterio / Accounting, Organizations and Society 27 (2002) 531–540 539
processing strategies based on the BSC categor- determine evaluations (e.g. Malina & Selto, 2000)
ization highlighting the potential relations among and in some cases an explicit two stage process is
measures within categories. used. In the first stage an evaluation is made on a
category-by-category basis and in the second stage
the manager combines these category level judg-
4. Limitations and conclusion ments into an overall performance evaluation (e.g.
Ittner, Larcker, & Meyer, 1998). Both approaches
Our experimental design has several limitations. are examples of managerial decision aids (i.e.
First, many of our experimental participants were explicit weights and formulas) to deal with infor-
novices in the use of the BSC and they did not mation processing limitations.
necessarily have business experience in the retail While we focus on the classification scheme
sector from which we pulled much of our case provided by Kaplan and Norton’s (1996a)
materials. Although the information processing BSC, other categorizations of performance
effects we tested relate to basic issues of cognition, measures may lead to similar results. That is, as
we do not know whether or how further experi- long as the categories have meaning to the deci-
ence may impact the observed effects.9 For exam- sion maker they may prime him or her to seek
ple, persons with more knowledge of the retail relations among measures and to react to any
apparel industry may have a better sense of the perceived correlations by reducing the impact of
sensitivity of each measure to managerial actions individual measures. Further research should
(perhaps affecting weightings on the measures) explore whether the BSC categorization has any
and to correlations among measures throughout unique effects, relative to other meaningful
the scorecard (again, affecting weightings). Thus, categorization schemes. Again, our study pro-
the BSC categories may have less effect on these vides a baseline against which future studies can
experts, who do not need the categorical organi- be compared.
zation to perceive and use these relations. Of The balanced scorecard has received significant
course, even experts will have a learning period, attention in the business press and a recent survey
during which they may act much like our experi- estimates 60% of Fortune 1000 firms have experi-
mental participants. Nonetheless, our experi- mented with the BSC (Silk, 1998). Despite this
mental results provide a baseline against which a widespread attention and use, the interaction of
study of BSC experts could be compared. the BSC and managers’ cognition has received lit-
A second limitation of our study is that we can- tle consideration. Many judgment issues deserve
not assess the accuracy of our participants’ eval- research attention. For example, how are trade-
uations since there is no accepted normative offs made across BSC categories, how are the
model for determining performance evaluation individual measures in a category weighted, and
scores (see also Lipe & Salterio, 2000). The direc- how does the differential reliability, precision, and
tion of the effect, however, seems consistent with other characteristics of the measures affect the
the normative response to nonindependence of weight placed on them? Other important
measures (e.g. Feltham & Xie, 1994). empirical questions include what is the covaria-
A third limitation of our study is that we only tion structure among the BSC measures within
investigate subjective performance evaluations and between categories, how are the categories
(Kaplan & Norton, 1996a, pp. 217–223). In some linked to underlying dimensions of managerial
companies explicit weightings or formulas are performance (i.e. how sensitive are they to man-
used for combining performance measures to agerial actions), and is it common for one distinct
aspect of managerial performance to affect mea-
sures in multiple BSC categories? Answers to these
9
Twenty of the subjects in experiment two indicated they questions will in turn lead to further research
had experience working for a BSC firm. Including experience as questions regarding the judgmental effects of the
a covariate in the analysis had no effect on the test results. BSC.
540 M.G. Lipe, S. Salterio / Accounting, Organizations and Society 27 (2002) 531–540