2
WriteFileEx vs Cohort Size A stage is conceptually similar to a class in an ob-
ject-based language, to the extent that it is a program
Cycles L2 Misses
structuring abstraction providing local state and opera-
120000 1800
1600
tions. Stages, however, differ from objects in three ma-
100000
1400 jor respects. First, operations in a stage are invoked
Cost of Call (Cycles)
3
Stage-A Stage-B Asynchronous operations provide low-cost paral-
invoke op-x
op-a op-x lelism, which enables a programmer to express and
exploit the concurrency within an application. The
invoke op-y
overhead, in time and space, of invoking an operation is
close to a procedure call, as it only entails allocating
op-y and initializing a closure and passing it to a stage.
Wait for op-x done
children When an operation runs to completion, it does not re-
quire its own stack or an area to preserve processor
op-y done
state, which eliminates much of the cost of threads.
Similarly, returning a value and re-enabling a continua-
Figure 3. Example of stages and operations. Stage-A runs op-a, tion are simple, inexpensive operations.
which invokes two operations in Stage-B and waiting until they
complete before running op-a’s continuation.
3.3 Programming Styles
In practice, designing a program with stages fo-
Staged computation supports a variety of pro-
cuses on partitioning the tasks into sub-tasks that are
gramming styles, including software pipelining, event-
self-contained, have considerable code and data local-
driven state machines, bi-directional pipelines, and
ity, and have logical unity. In many ways, this process
fork-join parallelism. Conceptually, at least, stages in a
is the control analogue of object-oriented design.
server are arranged as a pipeline in which requests ar-
3.2 Operations rive at one end and responses flow from the other. This
Operations are asynchronous computations ex- form of computation is easily supported by representing
ported by a stage. Invocation of an operation only re- a request as an object passed between stages. Linear
quires its eventual execution, so the invoker and opera- pipelining of this sort is simple and efficient, because a
tion run independently. When an operation executes, it stage retains no information on completed computa-
can invoke any number of child operations on any tions.
stage, including its own. A parent can wait for its chil- However, stages are not constrained to this linear
dren to finish, retrieve results from their computation, style. Another common programming idiom is bi-
and continue processing. Figure 3 shows an operation directional pipelining, which is the asynchronous ana-
(op-a) running in Stage-A that invokes two operations logue of call and return. In this approach, a stage passes
(op-x and op-y) in Stage-B, performs further computa- subtasks to one or more other stages. The parent stage
tion, and then waits for its children. After they complete eventually suspends its work on the request, turns its
and return their results, op-a continues execution and attention to other requests, and resumes the original
processes the children’s results. computation when the subtasks produce results. This
The code within an operation executes sequentially style requires that an operation be broken into a series
and can invoke both conventional (synchronous) calls of subcomputations, which run when results appear.
and asynchronous operations. However, once started, With explicit continuations, a programmer partitions
an operation is non-preemptible and runs until it relin- the computation by hand, although a compiler could
quishes the processor. Programmers, unfortunately, easily produce this code, which is close to the well-
must be careful not to invoke blocking operations that known continuation-passing style [6, 12]. With implicit
suspend the thread running operations on a processor. continuations, a programmer only needs to indicate
An operation that relinquishes the processor to wait for where the original computation suspends and waits for
an event—such as asynchronous I/O, synchronization, the subtasks to complete.
or operation completion—instead provides a continua- A generalization of this style is event-driven pro-
tion to be invoked when the event occurs [14]. gramming, which uses a finite state automaton (FSA) to
A continuation consisting of a function and enough control a reactive system [26, 29]. The FSA logic is
saved state to permit the computation to resume at the encapsulated in a stage and is driven by external events,
point at which it suspended. Explicit continuations are such as network messages and I/O completions, and
the simplest and least costly approach, as an operation internal events from other asynchronous stages. An
saves only its live state in a structure called a closure. operation’s closure contains the FSA state for a particu-
The other alternative, implicit continuations, requires lar server request. The FSA changes state when a child
the system to save the executing operation’s stack, so operation completes or external events arrive. These
that it can be resumed. This scheme, similar to fibers, transitions invoke computations associated with edges
simplifies programming, at some performance cost [2]. in the FSA. Each computation runs until it blocks and
specifies the next state in the FSA.
4
100 106
993 2
200 100
105
86 106 191 1
2189
105
222
2732
[0] Web [1] FileCache [2] Aggregate [3] Disk [4] Network
1.5 1118338.4
135
209
34 30906.1 1015
1897
1336 1.1 1328
966 14.7
1088 6.4 101
105
106 0.0 3825
269 63.6
199
191
186 18.2 101
105
106 7.8 43.4 28.4 3.9 1 114.7 98.2
2.7
3 .5
2 .9
2 .4
3 .9
15524 34800888500 549356 977 10477 22410 24094 100 162 2053
19719979 19981 100 13674 12934 22345 16917 963600 968400
692566 691962 189 623959 155930 468010 143787 31896 111891 108131 30313 77818 834769 370464 464136
1.4 2174874.8 35637.7 1.2 16.1 6.9 0.0 63.3 18.2 7.8 43.9 29.4 4.6 120.9 104.3
2.8
3.6
2.9
2.5
3.9
12537 40600595460 714204 13186 20519 22583 437 18132 18134 12391 12931 20016 15742 692580 699020
632337 631780 186 571800 141832 429959 143991 32103 111888 105343 30292 75051 712898 317115 395647
1.5 1463461.1 22186.0 1.4 20.8 8.3 0.1 67.2 19.7 8.5 43.6 31.2 4.4 108.3 94.7
2 .9
2 .8
3 .5
2 .5
4 .1
13076 29000687770 790000 12877 21685 23181 210 18273 18313 14162 12252 21005 17050 697590 703690
575098 1721
886
574692 129 493762 614
689
118989 374765 144385 105
100
32519 111866 102281 130316 71965 711512 7
5
315537 395885
1.4 2426046.1 22321.5 1.2 12.6 6.0 0.0 64.3 18.6 8.1 43.8 28.8 5.2 113.1 98.9
2. 9
3.8
2.9
2.4
3.9
15348 34800683760 789000 13511 21892 24367 155 20306 20321 13639 12519 21196 18288 695780 701860
705170 704457 311 716325 183892 532409 143826 32008 111818 109100 30315 78785 767838 342085 425635
83
651 2560 3
838
921 471 101
1395 34
1221 106
87
841 101
673 3342 453
480 101 1328
1
160
114
Figure 4 Profile of staged web server (Section 5.1). The performance metrics for each stage are broken down by processor (the system is
running on four processors). The first column is the average queue length. The second column contains three metrics on operations at the
stage: the quantity, the average wait time (millisecond), and the maximum wait time. The third column contains corresponding metrics for
operations that are suspended and restarted. The fourth column contains corresponding metrics for completed operations. The numbers
on arcs are the number of operations started or restarted between two stage-processor pairs.
For example, the web server used in Section 5.1 is stage’s mutex can be amortized over a cohort of op-
driven by a control-logic stage consisting of a FSA with erations [25].
fifteen states. The FSA describes the process by which
• A partitioned stage divides invocations (based on a
a HTTP GET request arrives and is parsed, the refer-
enced file is found in the cache or on disk, the file key passed as a parameter), to avoid sharing data
blocks are read and transmitted, and the file and con- among operations running on different processors.
nection are closed. For example, consider a file cache stage that parti-
tions requests using a hash function on the file
Describing the control logic of a server as a FSA number. Each processor maintains its own hash ta-
opens the possibility of verifying many properties of the ble of in-memory disk blocks. Each hash table is ac-
entire system, such as deadlock freedom, by applying cessed by only one processor, which enhances local-
techniques, such as model checking [15, 22], developed ity and eliminates synchronization. This policy,
to model and verify systems of communicating FSAs. which is reminiscent of shared-nothing databases,
permits parallel data structures without fine-grain
3.4 Scheduling Policy Refinements synchronization.
The third attribute of a stage is scheduling auton- • A shared stage runs its operations concurrently on
omy. When a stage is activated on a processor, the stage many processors. Since several operations in a stage
determines which operations execute and their order. can execute concurrently, shared data accesses must
This scheduling freedom allows several refinements of be synchronized.
cohort scheduling to reduce the need for synchroniza-
tion. In particular, we found three policies useful: Other policies are possible and could be easily imple-
mented within a stage.
• An exclusive stage executes at most one of its op-
erations at a time. Since operations run sequentially It is important keep in mind that these policies are
and completely, access to stage-local data does not implemented within the more general framework of
need synchronization. This type of a stage is similar cohort scheduling. When a stage is activated on a proc-
to a monitor, except that its interface is asynchro- essor, it executes its outstanding operations, one after
nous: clients delegate computation to the stage, another. Nothing in the staged model requires cohort
rather than block to obtain access to a resource. scheduling. Rather the programming model and sched-
When this strict serialization does not cause a per- uling policy naturally fit together. A stage groups logi-
formance bottleneck, this policy offers fast, error- cally related operations that share data and provides the
free access to data and a simple programming freedom to reorder computations. Cohort scheduling
model. This approach works well for fine-grained exploits scheduling freedom by consecutively running
operations, as the cost of acquiring and releasing the similar operations.
5
3.5 Understanding Performance
CPU 1
I/O
A compelling advantage of the Staged model is FileCache
Aggregator
Disk I/O
D e m u l t ip l e x e r
FileCache Disk I/O
Aggregator Event
are easily measured and displayed (Figure 4). These Server
6
4.1.1 Scheduling Policy 4.1.3 Thresholds
StagedServer implements a cohort scheduling pol- An orthogonal attribute of a stage is a pair of
icy, with enhancements to increase the processor affin- thresholds that force StagedServer to activate a stage if
ity of data. The assignment of operations to processors more than a given number of operations are waiting or
occurs when an operation is submitted to a stage. By after a fixed interval. When either situation arises,
default, an operation invoked by code running on proc- StagedServer stops the currently running stage (after it
essor p executes on processor p in subsequent stages. completes its operation), runs the threshold-exceeding
This affinity policy enhances temporal locality and re- stage, and then returns to the suspended stage. For sim-
duces cache traffic, as the operation’s data tend to re- plicity, an interrupting stage cannot be interrupted, so
main in the processor’s cache. However, a program can that other stages that exceed their thresholds are de-
override the default and execute an operation on a dif- ferred until processing returns to original stage. Thresh-
ferent processor when: the processor to execute the olds are particularly useful for latency-sensitive stages,
operation is explicitly specified, a stage partitions its such as those interacting with the operating system,
operations among processors, or a stage uses load bal- which must be regularly supplied with I/O requests to
ancing to redistribute operations. ensure that devices do not go idle.
Another useful refinement is a feedback mecha-
A stage maintains a stack and queue for each proc- nism, by which a stage informs other stages that it has
essor in the system. In general, operations originating sufficient tasks. These other stages can suspend proc-
on the local processor are pushed on the stack and op- essing, effectively turning the processor over to the first
erations from other processors are enqueued on the stage. So far, voluntary cooperation, rather than hard
queue. When a stage starts processing a cohort, it first queue limits, has sufficed.
empties its stack in LIFO order, before turning to the
queue. This scheme has two rationales. Processing the 4.1.4 Partitioned Data
most recently invoked operations first increases the
A partitioned stage typically divides its data, so
likelihood that an operation’s data will reside in the
that the operations running on a processor access only a
cache. In addition, the stack does not require synchroni-
non-shared portion. Avoiding sharing eliminates the
zation, since it is only accessed by one processor, which
need to synchronize access to the data and reduces the
reduces the common-case cost of invoking an opera-
cache traffic that results when data are accessed from
tion.
more than one processor. The current system partitions
a variable—using the well-known technique of privati-
4.1.2 Processor Scheduling zation [30]—by storing its values in a vector with an
entry for each processor. Code uses the processor id to
StagedServer currently uses a simple, wavefront
index this vector and obtain a private value.
algorithm to supply processors to stages. A programmer
specifies an ordering of the stages in an application. In 4.2 Closure Class
wavefront scheduling, processors independently alter- Closure is a templated base class for defining clo-
nate forward and backward traversals of this list of sures, which are a combination of code and data.
stages. At each stage, a processor executes operations StagedServer uses closures to implement operations and
pending in its stack and queue. When the operations are their continuations. When an operation is first invoked
finished, the processor proceeds to the next stage. If the on a stage, the invoker creates a closure and initializes
processor repeatedly finds no work, it sleeps for expo- it with parameter values. Later, the stage executes the
nentially increasing periods of time interval. If a proc- operation by invoking one of the closure’s methods, as
essor cannot gain access to an Exclusive stage, because specified by the operation invocation. This method is an
another processor is already working in the stage, the ordinary C++ method. When it returns, the method
processor skips the stage. must state whether the operation is complete (and op-
The alternating traversal order in wavefront sched- tionally returns a value), if it is waiting for a child to
uling corresponds to a common communications finish, or if it is waiting for another operation to resume
pattern, in which a stage passes requests to its succes- its execution.
sors, which perform a computation and produce a re- An operation can invoke operations on other
sult. It is easy to imagine other scheduling policies, but stages—its children. The original operation waits for its
we have not evaluated them, as this approach works children by providing a continuation routine that the
well for the applications we have studied. This topic is system runs when the children finish. This continuation
worth further investigation. routine is simply another method in the original closure.
7
The closure passes arguments between a parent and its 10000RPM SCSI3 disks, connected to a Compaq Smart
continuation and results between a child and its parent. Array controller. The clients ran on Dell PowerEdge
This process may repeat multiple times, with each con- 6350s, each containing four 400MHz Pentium II Xeon
tinuation taking on the role of a parent. In other words, processors with 1GB of RAM. The clients and server
these closures are actually multiple-entry closures, with were connected by a dedicated Gigabit Ethernet net-
an entry for the original operation invocation and en- work and both ran Windows 2000 Server (SP1).
tries for subsequent continuations. In practice, a stage
We used the SURGE benchmark, which retrieves
treats these methods identically and does not distin-
web pages, whose size, distributions, and reference
guish between an operation and its continuation.
pattern are modeled on actual systems [8]. SURGE
5 Experimental Evaluation measures the ability of a web server to process HTTP
GET requests, retrieve pages from a disk, and send
To evaluate the benefits of cohort scheduling and them back to a client. This benchmark does not attempt
the StagedServer library, we built two prototypical to capture the full behavior of a web server, which must
server applications. The first—a web server—is I/O- handle other types of HTTP requests, execute dynamic
bound, as its task consists of responding to HTTP GET content, perform server management, and log data. To
requests by retrieving files from a disk or file cache and increase the load, we run a large configuration, with a
sending them over a network. The second—a publish- web site of 1,000,000 pages (20.1GB) and a reference
subscribe server—is compute bound, as the amount of stream containing 6,638,449 requests. A SURGE work-
data transferred is relatively small, but the computation load is characterized by User-Equivalents (UEs), each
to match an event against a database of subscriptions is of which models one user accessing the web site. We
expensive and memory-intensive. found that we could run up to 2000 UEs per client. All
5.1 I/O-
I/O-Intensive Server tests were run with the UE workload balanced across
the client machines. The reported numbers are for 15
To compare threads against stages, we imple- minutes of client execution, starting with a freshly ini-
mented two web servers. The first is structured using a tialized server.
thread pool (THWS) and the second uses StagedServer
(SSWS). We took care to make the two servers efficient Figure 6 shows the bandwidth and latency of the
and comparable and to share common code. In particu- thread (THWS) and StagedServer (SSWS) servers, and
lar, both servers use Microsoft Window’s asynchronous compares them against a commercial web server (IIS).
I/O operations. The threaded server was organized in a The first chart contains the number of pages retrieved
conventional manner as a thread accepting connections by the clients per second (since requests follow a fixed
and passing them to a pool of 256 worker threads, each sequence, the number of pages is a measure of band-
of which performs the server’s full functionality: pars- width) and the second chart contains the average la-
ing a request, reading a file, and transmitting its con- tency, perceived by a client, to retrieve a page.
tents. This server used the kernel’s file cache. The Several trends are notable. Under light load,
SSWS server also can process up to 256 simultaneously SSWS’s performance is approximately 6% lower than
requests. It was organized as a control logic stage, a THWS, but as the load increases, SSWS responds to as
network I/O stage, and the disk I/O and caching stages many as 13% more requests per unit time. The second
described in Section 3.6. The parameters were chosen chart, in part, explains this difference. SSWS’s latency
by experimentation and yielded robust performance for is higher than THWS’s latency under light load (by a
the benchmark and hardware configuration. factor of almost 20), but as the load increases, SSWS’s
As a baseline for comparison, we also ran the ex- latency grows only 2.3 times, but THWS’s latency in-
periments on Microsoft’s IIS web server, which is a creases 45 times, to a level equal to SSWS’s.
highly tuned commercial product. IIS performed better The commercial server, Microsoft’s IIS, outper-
than the other servers, but the difference was small, formed SSWS by 4–9% and THWS by 0–22%. Its la-
which partially validates their implementations. tency under heavy load was up to 45% better than the
Our test configuration consisted of a server and other servers’ latency.
three clients. The server was Compaq Proliant DL580R
containing four 700MHz Pentium III-Xeon processors
(2MB L2 cache) and 4GB of RAM. It had eight
8
HTTP GET Bandw idth CPI
THW S
3,500 THW S 4.0
SS W S
3,000 S SW S 3.5
Cycles/ Instruction
2,500 IIS 3.0
Pages/ Se c
2.5
2,000
2.0
1,500
1.5
1,000 1.0
500 0.5
0 0.0
1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000
U Es U Es
THW S L2 K erne l
HTTP GET Latency L2 Cache Mis ses
S SW S L2 K erne l
1200 14%
THW S
12% THW S L2 U ser
Av g. Late ncy (millis ec.)
1000 S SW S
S SW S L2 U ser
IIS 10%
800
Miss R ate
8%
600
6%
400
4%
200 2%
0 0%
1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000
U Es U Es
Figure 6. Performance of web servers. These charts show the Figure 7. Processor performance of servers. These charts show
performance of the threaded server (THWS), StagedServer server the processor performance of the threaded (THWS) and Staged-
(SSWS), and Microsoft’s IIS server (IIS). The first records the num- Server (SSWS) web server. The first chart shows the cycles per
ber of web pages received by the clients per second. The second instruction (CPI) and the second shows the rate of L2 cache
records the average latency, as perceived by the client, to retrieve misses.
a page. The error bars are the standard deviation of the latency.
SSWS performance, which is more stable and pre- publish-subscribe implementation; the only difference
dictable under heavy load than the threaded server, is between them was the use of threads or stages to struc-
appropriate for servers, in which performance chal- ture the computation. The benchmark was the Fabret
lenges arise as offered load increases. SSWS server’s workload: 1,000,000 subscriptions and 100,000 events.
overall performance was relatively better and its proc- The platform was the same as above.
essor performance degraded less under load than the The response time of the StagedServer version to
THWS server. The improved processor performance events was better under load (Figure 8). With four or
was reflected in a measurably improved throughput more clients publishing events, the THPS responded in
under load. an average of 0.57 ms to each request. With four cli-
ents, SSPS responded in an average time of 0.53 ms,
5.2 Compute-
Compute-Bound Server and its response improved to 0.47 ms with 16 or more
To evaluate the performance of StagedServer on a clients (21% improvement over the threaded version).
compute-bound application, we also built a simple pub- In large measure, this improvement is due to im-
lish-subscribe server. The server used an efficient, proved processor usage (Figure 8). With 16 clients,
cache-friendly algorithm to match events against an in- SSPS averaged 2.0 cycles per instruction (CPI) over the
core database of subscriptions [16]. A subscription is a entire benchmark, while THPS averaged 2.7 CPI (26%
conjunction of terms comparing variables against inte- reduction). Over the compute-intensive event matching
ger. An event is a set of assignments of values to vari- portion, SSPS averaged 1.7 CPI, while THPS averaged
ables. An event matches a subscription if all of its terms 2.5 CPI (33% reduction). In large measure, this im-
are satisfied by the value assignments in the event. provement is attributable to a greater than 50% reduc-
tion in L2 caches misses, from 58% of user-space L2
Both the threaded (THPS) and StagedServer cache requests (THPS) to a 26% of references (SSPS).
(SSPS) version of this application shared a common
9
Average Response Time
subset of the data structure, and a queue in front of each
process to accumulate a cohort.
Threaded StagedServer
0.8
0.6 The advantages and disadvantages of threads and
0.4 processes are widely known [5]. More recently, several
0.2 papers have investigated alternative server architec-
0.0 tures. Chankhunthod et al. described the Harvest web
0 20 40 60 80 cache, which uses an event-driven, reactive architecture
Number of Clients to invoke computation at transitions in its state-machine
control logic [13]. The system, like StagedServer, uses
non-blocking I/O; careful avoidance of page faults; and
Processor Performance (16 Clients) a non-blocking, non-preemptive scheduling policy [7,
26]. Pai proposed a four-fold characterization of server
Threaded StagedServer
architectures: multi-process, multi-threaded, single-
4.00 process event driven, and asymmetric multi-process
3.00 event driven [26]. These alternatives are orthogonal to
Average CPI
10
A stage is similar in some respects to an object in 7 Conclusion
an object-based language, in that it provides both local
state and operations to manipulate it. The two differ Servers are commonly structured as a collection of
because objects are, for the most part, passive and their parallel tasks, each of which executes all the code nec-
methods are synchronous—though asynchronous object essary to process a request. Threads, processes, or event
models exist. Many object-oriented languages, such as handlers underlie the software architecture of most
Java [17], integrate threads and synchronization, but the servers. Unfortunately, this software architecture can
active entity remains a thread synchronously run a interact poorly with modern processors, whose per-
method on a passive object. By contrast, in staged com- formance depends on mechanisms—caches, TLBs, and
putation, a stage is asked to perform an operation, but is branch predictors—that exploit program locality to
given the autonomy to decide how and when to actually bridge the increasing processor-memory performance
execute the work. This decoupling of request and re- gap. Servers have little inherent locality. A thread typi-
sponse is valuable because it enables a stage to control cally runs for a short and unpredictable amount of time
its concurrency and to adopt an efficient scheduling and is followed by an unrelated thread, with its own
policy, such as cohort scheduling. working set. Moreover, servers interact frequently with
the operating system, which has a large and disruptive
Stages are similar in some respects to Agha’s Ac- working set. The poor processor performance of servers
tors [3]. Both start with a model of asynchronous com- is a natural consequence of their threaded architecture.
munication between autonomous entities. Actors have As a remedy, we propose cohort scheduling, which
no internal concurrency and do not give entities control increases server locality by consecutively executing
over their scheduling, but instead presume a reactive related operations from different server requests. Run-
model in which an Actor responds to a message by in- ning similar code on a processor increases instruction
voking a computation. Stages, because of the internal and data locality, which aids hardware mechanisms,
concurrency and scheduling autonomy, are better suited such as cache and branch predictors. Moreover, this
to cohort scheduling. Actors are, in turn, an instance of architecture naturally issues operating system requests
dataflow, a more general computing model [23, 24]. in batches, which reduces the system’s disruption.
Stages also can be viewed as an instance of dataflow
computation. This paper also describes the staged computation
programming model, which supports cohort scheduling
Cilk is language based on a provably efficient by providing an abstraction for grouping related opera-
scheduling policy [11]. The language is thread, not ob- tions and mechanisms through which a program can
ject, based, but it shares some characteristics with implement cohort scheduling. This approach has been
stages. In both, once started, a computation is not pre- implemented in the StagedServer library. In a series of
empted. While running, a computation can spawn off tests using a web server and publish-subscribe server,
other tasks, which return their results by invoking a the StagedServer code performed better than threaded
continuation. However, Cilk’s work stealing scheduling code, with a lower level of cache misses and instruction
policy does not implement cohort scheduling, nor is it stalls and better performance under heavy load.
under program control. Recent work, however, has im-
proved the data locality of work stealing scheduling Acknowledgements
algorithms [1].
This work has on going for a long time, and count-
JAWS is an object-oriented framework for writing less people have provided invaluable insights and feed-
web servers [18]. It consists of a collection of design back. This list is incomplete; and we apologize for
patterns, which can be used to construct servers adapted omissions. Rick Vicik made important contributions to
to a particular operating system by selecting an appro- the idea of cohort scheduling and the early implementa-
priate concurrency mechanism (processes or threads), tions. Jim Gray has been a ceaseless supporter and ad-
creating a thread pool, reducing synchronization, cach- vocate of this work. Kevin Zatloukal helped write the
ing files, using scatter-gather I/O, or employing various web server and run many early experiments. Trishul
http and TCP-specific optimizations. StagedServer is a Chilimbi, Jim Gray, Vinod Grover, Mark Hill, Murali
simpler library that provides a programming model that Krishnan, Paul Larson, Milo Martin, Ron Murray, Luke
directly enhances program locality and performance. McDowell, Scott McFarling, Simon Peyton-Jones,
Mike Smith, and Ben Zorn provided many helpful
An earlier version of this work was published as a questions and comments. The referees and shepherd,
short, extended abstract [21]. Carla Ellis, provided many helpful comments.
11
References [16] Françoise Fabret, H. Arno Jacobsen, François Llirbat, Joao
Pereira, Kenneth A. Ross, and Dennis Shasha, "Filtering Algorithms
[1] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe, "The and Implementation for Very Fast Publish/Subscribe Systems," in
Data Locality of Work Stealing," in Proceedings of the Twelfth ACM Proceedings of the 2001 ACM SIGMOD International Conference on
Symposium on Parallel Algorithms and Architectures (SPAA). Bar Management of Data and Symposium on Principles of Database
Harbor, ME, July 2000. Systems. Santa Barbara, CA, May 2001, pp. 115-126.
[2] Atul Adya, Jon Howell, Marvin Theimer, William J. Bolosky,
[17] James Gosling, Bill Joy, and Guy Steele, The Java Language
and John R. Douceur, "Cooperative Tasking without Manual Stack
Specification: Addison Wesley, 1996.
Management," in Proceedings of the 2002 USENIX Annual Techni-
cal Conference. Monterey, CA, June 2002. [18] James Hu and Douglas C. Schmidt, "JAWS: A Framework for
[3] Gul A. Agha, ACTORS: A Model of Concurrent Computation High-performance Web Servers," in Domain-Specific Application
in Distributed Systems. Cambridge, MA: MIT Press, 1988. Frameworks: Frameworks Experience By Industry, M. Fayad and R.
Johnson, Eds.: John Wiley & Sons, October 1999.
[4] Anastassia G. Ailamaki, David J. DeWitt, Mark D. Hill, and
David A. Wood, "DBMSs on a Modern Processor: Where Does Time [19] F. Irigoin and R. Troilet, "Supernode Partitioning," in Proceed-
Go?," in Proceedings of 25th International Conference on Very ings of the Fifteenth Annual ACM Symposium on Principles of Pro-
Large Data Bases. Edinburgh, Scotland: Morgan Kaufmann, Septem- gramming Languages. San Diego, CA, January 1988, pp. 319-329.
ber 1999, pp. 266-277.
[20] Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger
[5] Thomas E. Anderson, "The Performance Implications of Thread
C. Raphael, and Walter E. Baker, "Performance Characterization of a
Management Alternatives for Shared-Memory Multiprocessors,"
Quad Pentium Pro SMP Using OLTP Workloads," in Proceedings of
IEEE Transactions on Parallel and Distributed Systems, vol. 1, num.
the 25th Annual International Symposium on Computer Architecture.
1, pp. 6-16, 1990.
Barcelona, Spain, June 1998, pp. 15-26.
[6] Andrew W. Appel, Compiling with Continuations. Cambridge
University Press, 1992. [21] James R Larus and Michael Parkes, "Using Cohort Scheduling
to Enhance Server Performance (Extended Abstract)," in Proceedings
[7] Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul, "Better
of the Workshop on Optimization of Middleware and Distributed
Operating System Features for Faster Network Servers," in Proceed-
Systems. Snowbird, UT, June 2001, pp. 182-187.
ings of the Workshop on Internet Server Performance. Madison, WI,
June 1998. [22] James R. Larus, Sriram K. Rajamani, and Jakob Rehof, "Behav-
[8] Paul Barford and Mark Crovella, "Generating Representative ioral Types for Structured Asynchronous Programming," Microsoft
Web Workloads for Network and Server Performance Evaluation," in Research, Redmond, WA, May 2001.
Proceedings of the 1998 ACM SIGMETRICS Joint International
Conference on Measurement and Modeling of Computer Systems. [23] Edward A. Lee and Thomas M. Parks, "Dataflow Process Net-
Madison, WI, June 1998, pp. 151-160. works," Proceedings of the IEEE, vol. 83, num. 5, pp. 773-799, 1995.
[9] Luiz André Barroso, Kourosh Gharachorloo, and Edouard [24] Walid A. Najjar, Edward A. Lee, and Guang R. Gao, "Advances
Bugnion, "Memory System Characterization of Commercial Work- in the Dataflow Computation Model," Parallel Computing, vol.
loads," in Proceedings of the 25th Annual International Symposium 251907-1929, 1999.
on Computer Architecture. Barcelona, Spain, June 1998, pp. 3-14.
[25] Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa,
[10] Trevor Blackwell, "Speeding up Protocols for Small Messages," "Executing Parallel Programs with Synchronization Bottlenecks
in Proceedings of the ACM SIGCOMM '96 Conference on Applica- Efficiently," in Proceedings of International Workshop on Parallel
tions, Technologies, Architectures, and Protocols for Computer and Distributed Computing for Symbolic and Irregular Applications
Communication. Palo Alto, CA, August 1996, pp. 85-95. (PDSIA '99). Sendai, Japan: World Scientific, July 1999, pp. 182-
[11] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, 204.
Charles E. Leiserson, Keith H. Randall, and Yuli Zhou, "Cilk: An
Efficient Multithreaded Runtime System," Journal of Parallel and [26] Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, "Flash: An
Distributed Computing, vol. 37, num. 1, pp. 55-69, 1996. Efficient and Portable Web Server," in Proceedings of the 1999
USENIX Annual Technical Conference. Monterey, CA, June 1999,
[12] Satish Chandra, Bradley Richards, and James R. Larus, "Teapot: pp. 199-212.
A Domain-Specific Language for Writing Cache Coherence Proto-
cols," IEEE Transactions on Software Engineering, vol. 25, num. 3, [27] David A. Patterson and John L. Hennessy, Computer Architec-
pp. 317-333, 1999. ture: A Quantitative Approach, 2 ed. Morgan Kaufmann, 1996.
[13] Anawat Chankhunthod, Peter Danzig, Chuck Neerdaels, Mi- [28] Sharon Perl and Richard L. Sites, "Studies of Windows NT
chael F. Schwartz, and Kurt J. Worrell, "A Hierarchical Internet Ob- Performance using Dynamic Execution Traces," in Proceedings of the
ject Cache," in Proceedings of the USENIX 1996 Annual Technical Second USENIX Symposium on Operating Systems Design and
Conference. San Diego, CA, January 1996. Implementation (OSDI). Seattle, WA, October 1997, pp. 169-183.
[14] Richard P. Draves, Brian N. Bershad, Richard F. Rashid, and
Randall W. Dean, "Using Continuations to Implement Thread Man- [29] Matt Welsh, David Culler, and Eric Brewer, "SEDA: An Archi-
agement and Communication in Operating Systems," in Proceedings tecture for Well-Conditioned, Scalable Internet Services," in Proceed-
of the Thirteenth ACM Symposium on Operating System Principles. ings of the 18th ACM Symposium on Operating Systems Principles
Pacific Grove, CA, October 1991, pp. 122-136. (SOSP '01). Alberta, Canada, October 2001, pp. 230-243.
[15] Edmund M. Clarke Jr., Orna Grumberg, and Doron A. Peled, [30] Michael J. Wolfe, High Performance Compilers for Parallel
Model Checking. Cambridge, MA: MIT Press, 1999. Computing. Addison-Wesley, 1995.
12