Multiprocessors
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
evicting each other’s data from any shared cache. Com- 2 outlines prior related research. Section 3 describes our
munication between multiple threads on CMPs takes place architecture optimizations to improve synchronization and
via the shared memory hierarchy, such as a shared cache communication support. Section 4 presents our benchmark
or main memory. Due to demand-based data transfers and applications, while Section 5 shows the results of our sim-
cache coherence mechanisms, the requesting thread stalls ulations and compares performance with and without our
until the data is delivered. We believe a mechanism to ex- techniques. Finally, Section 6 summarizes our contributions
ploit the cores’ proximity and allow fast communications and outlines future work to be done to further improve syn-
between cores is needed. Previous research has shown the chronization and communication support on CMPs.
benefits of data forwarding to reduce communication la-
tency in distributed shared memory multiprocessors. We 2. Related Research
propose employing a similar data movement concept within
the chip on CMPs with private caches that will extend the
Several machines used hardware registers to provide
communication support offered by today’s CMPs, and re-
synchronization and communication among multiple pro-
sult in reduced communication latency and miss rates.
cessors. The Cray X-MP [2] had several clusters of shared
The hardware optimizations proposed here are applica-
registers for interprocess communication and synchroniza-
ble to many types of communicating threads, but were mo-
tion. The operating system could assign a cluster to several
tivated by SPPM’s behavior on current CMPs, particularly
processors, to enable the use of the cluster for communi-
those with private caches. It is shown in [18] that commu-
cation and synchronization. The Cray T3E’s [15] memory
nicating through the cache coherence mechanism is slower
interface was improved with a set of explicitly-managed ex-
than communicating through memory for some common
ternal registers. All remote communication and synchro-
CMPs. Using Register-Based Synchronization (RBS) will
nization was done between these registers and memory. The
reduce the spin waiting of the threads and cut cache con-
M-Machine [4] also provided synchronization through reg-
tention and overhead. In addition, our approach to move
isters. For register synchronization, full/empty bits were
data as it becomes available (called prepushing), reduces
used to determine when register values were valid. This ap-
the latency experienced by the consumer thread as cache
proach allowed the waiting thread to stall rather than spin.
misses are turned into hits.
The experiment results show that register-based communi-
Similar ideas have been studied in the literature for large- cation and synchronization perform better than memory-
scale machines, but application of RBS and prepushing to based communication and synchronization.
CMPs is new. Our work also focuses on SPPM applica- A hardware-based synchronization approach for simul-
tions, which provide better memory performance on CMPs taneous multithreaded (SMT) processors is presented in
than regular parallel applications. Our implementation does [16]. The presented approach, blocking acquire and re-
not rely on directory cache coherence protocols, which has lease, is a hardware implementation of traditional soft-
been the case in previous studies, but uses broadcast cache ware synchronization approaches. Synchronization through
coherence protocol similar to that of AMD’s Opteron [8]. full/empty registers is also presented. The success in prior
Furthermore, our study targets CMPs with private L1 and work on hardware-based synchronization leads us to em-
L2 caches, which is very different from previous studies. ploy full/empty registers to extend the synchronization sup-
Finally, prepushing forwards data based on hints received port offered by today’s CMPs.
from the application, and hence is software controlled to Data forwarding has been studied on several platforms.
reduce complexity and provide flexibility. The Stanford Dash Multiprocessor [11] provides operations
Our contributions include: (1) We present significant that allow the producer to send data directly to consumers.
architecture optimizations to improve the synchronization The update-write operation sends data to all processors that
and communication support for multithreaded producer- have the data in their caches, while the deliver operation
consumer applications running on CMPs. (2) We evalu- sends the data to specified processors. The KSR1 [5] pro-
ate the performance of a well-established synchronization vides programmers with a poststore operation. When a vari-
mechanism, hardware registers, on today’s CMPs. (3) We able is updated, using poststore causes a copy of the variable
introduce a data forwarding approach, prepushing, tailored to be sent to all caches which contain a copy of that variable.
to CMPs with private caches. (4) We evaluate performance The experimental study of the poststore is presented in [14].
of four different prepushing models, including shared and A compiler algorithm framework for data forwarding is
exclusive prepushing and prepushing to different cache lev- presented in [10]. Write-and-Forward assembly instruction
els. (5) RBS and prepushing show promising results as is inserted by the compiler replacing ordinary write instruc-
architecture optimizations to support synchronization and tions. The same data forwarding approach and prefetch-
communication on CMPs. ing are compared in [9]. Similarly, another study shows
The rest of the paper is organized as follows: Section that prefetching is insufficient for producer-consumer shar-
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
ing patterns and migratory sharing patterns in [1]. A range
of consumer- and producer-initiated mechanisms and their
performance on a set of benchmarks is studied in [3]. Store-
ORDered Streams (SORDS) [19] is a memory streaming
approach to send data from producers to consumers in dis-
tributed shared memory multiprocessors. This approach is
based on the producer-consumer temporal address correla-
tion, which claims that shared values are consumed in ap-
proximately the same order that they were produced.
Although all these studies on data forwarding fall into
the same category as ours, the implementation techniques
and target areas are different. First, our implementation
does not rely on directory cache coherence protocols, but
uses broadcast cache coherence protocol similar to AMD
Opteron’s cache coherence protocol. Second, our study is Figure 1. RBS for SPPM Applications
not tailored to distributed shared multiprocessors, but tar-
gets CMPs with private L1 and L2 caches. Data forward- down the accesses to the synchronization variables and eval-
ing on CMPs is new, because CMPs are new. Third, our uated the gain of using hardware registers for SPPM appli-
approach forwards data based on hints received from the cations on CMPs. To dedicate a set of memory locations
application, and hence it is software controlled to reduce for the evaluation of hardware registers, we wrote a device
complexity and provide flexibility. driver and mapped it to the memory. Then, we accessed
the device as if we were accessing the hardware registers.
The following pseudo-code shows a simple consumer im-
3. CMP Architecture Optimizations plementation in SPPM applications. The data is partitioned
into blocks; a data block is read in each loop iteration.
3.1. Register-Based Synchronization (RBS) Before reading a data block, the consumer has to check
whether the synchronization window has been violated or
not. The synchronization window is the interval between
In earlier systems, hardware registers were used to pro- the producer’s current position and consumer’s current po-
vide synchronization and communication among proces- sition in the shared data array.
sors. The success of this prior work led us to employ for i = 0 to N do
full/empty registers to extend the synchronization support while synchronization window for dataBlock[i] is violated
offered by today’s CMPs. Even though the concept of using do
hardware registers for synchronization and communication wait
is not new, it has not been applied to CMPs before, yet is end while
sorely needed. As discussed in Section 2, research shows read dataBlock[i]
that RBS performs better than memory-based communica- generate results
end for
tion and synchronization.
In SPPM applications, the consumer is usually faster The condition to check the synchronization window vi-
olation includes accessing the synchronization variables.
than the producer because it finds its data in the cache rather
Since the data block size consumed per loop may exceed the
than the memory. Thus, when the consumer is done, it L1 cache size, the synchronization variables will be evicted
has to spin wait until the producer updates the next data from the L1 cache under LRU eviction mechanism. There-
block and grants access. Figure 1 shows our RBS scheme fore, we assume an optimal miss rate of 15% and estimate
for CMPs. The shared register has a full/empty status bit. the average access time to the synchronization variables to
When the producer reaches the synchronization point, it sets be 3 cycles on a CMP system with L1 cache of 1-cycle la-
the register status to full, which wakes the producer up. tency and L2 cache of 12-cycle latency, as shown below:
When the consumer is done, instead of spin waiting and
average access time = L1 latency + miss rate
consuming system resources, it goes into idle mode to save
power. Upon register status change, the consumer wakes up ∗ (L1 latency + L2 latency)
and the process repeats. We consider the synchronization = 1 + 0.15 ∗ (1 + 12)
registers to be shared between cores. ≈3
To evaluate RBS, we modified GEMS [13] Ruby mem-
ory model, which is based on the Simics [12] full system The cost of implementing RBS is the addition of shared
simulation platform. Instead of adding hardware registers hardware registers with full/empty bits to the system. The
to the system, we used memory mapped locations to track latency of accessing those registers would be the same as
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
Figure 2. Execution Behaviors of Conventional Approach and Prepushing
accessing any register on the system. In SPPM applica- thread to a specific processor. This is done to prevent nec-
tions, the number of hardware registers needed to replace essary data from being evicted from the cache. Our simu-
synchronization variables is three; two for the producer and lations take place on a Solaris 9 platform, so we use pro-
consumer positions, and one for the flag that resets the posi- cessor bind function to assign each thread to a specific pro-
tions when both threads are done at the end of each iteration. cessor. To evaluate the prepushing approach, we extended
the GEMS Ruby memory model. We added a prepusher as
3.2. Data Communications via Prepushing a hardware component integrated with the cache controller,
and experimented with MOESI broadcast cache coherence
Data forwarding has been shown to reduce communica- protocol. There is a prepush queue that contains informa-
tion latency and miss rates, as discussed in Section 2. The tion about data that will be forwarded from the producer to
previous studies mostly involve distributed shared memory the consumer. In addition, new events, actions, and state
multiprocessors and directory cache coherence protocols. transitions are added to the cache controller to describe the
In this paper, we present prepushing to facilitate data trans- prepushing behavior. In general, the cost of adding a pre-
fers between a single producer-consumer pair. Our work pusher to the system is considerably small, as it would be
brings the data forwarding idea (i.e. sending data before it implemented by extending the cache coherence protocol.
is requested) to CMPs with private L1 and L2 caches, and The prepushing approach is well suited to SPPM appli-
studies the effects of several prepushing models to improve cations with tightly synchronized producer-consumer pairs.
the performance of producer-consumer applications. Software hints are inserted into the SPPM applications,
Figure 2 illustrates the execution behavior of a single and hence the prepushing approach is software controlled.
producer-consumer pair, comparing the conventional ap- When the producer is done with a block of data, it signals
proach to prepushing. Normally, data is pulled by the con- the prepusher to initiate data movement by providing in-
sumer rather than pushed by the producer. Each data block formation such as data block address, data block size, and
consists of several cache lines. As the consumer accesses destination. The prepusher determines which cache lines
a data block, it issues cache line requests and must wait or comprise the data block and calculates the physical address
stall while each block is retrieved from the remote cache. If of each cache line. Then, messages containing information
each cache line is consumed in less time than it takes to get about cache lines are placed into the prepush queue. The
the next one, prefetching will not be fully effective at mask- cache controller reads the prepush queue and handles the
ing the remote fetch latency. Furthermore, prior prefetched data transfers in shared or in exclusive mode, based on the
cache lines may still have been in use by the producer, so a prepush request type. The data is prepushed to the L1 cache
prefetcher may add to the coherence traffic overhead rather or the L2 cache of the consumer, depending on the prepush-
than reduce it. On the other hand, prepushing allows the ing model. On the consumer side, if a tag exists for a par-
producer to send the data as soon as it is done with it. There- ticular prepushed cache line, it is directly written to the ex-
fore, the consumer receives the data by the time it is needed. isting location. Otherwise, it is written to a newly allocated
In the conventional approach, two operations per cache line location, which may require a cache line replacement. We
are seen on the bus, while only one operation per cache line explored four different prepushing models:
is seen with prepushing. The prepushing approach reduces PUSH-S-L1: The prepush policy is to send each cache
the number of data requests, the number of misses, and the line to the consumer in shared state as soon as it is pro-
communication latency seen by the consumer. duced, while the producer’s cache line assumes owned state.
We do not consider process/thread migration due to a This sharing behavior allows the producer to reuse the data,
context switch, because SPPM applications assign each if necessary. The placement policy puts the prepushed
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
Figure 3. Red Black Solver Synchronization Window
cache line into the L1 cache, which may include evicting Red Black (RB) Solver: The RB solver solves a partial
cache lines from the L1 to the L2 cache. differential equation using a finite differencing method. It
PUSH-X-L1: The prepush policy is to send each cache uses a two-dimensional grid, where each point is updated
line to the consumer in exclusive state as soon as the pro- by combining its present value with those of its four neigh-
ducer no longer needs it, thus invalidating it in the pro- bors. To avoid dependences, alternate points are labeled
ducer’s cache. This behavior is similar to migratory shar- red or black, and only one type is updated at a time. The
ing [7]. It improves performance when the consumer needs values are updated until convergence is reached. The appli-
to write to the cache line, as it eliminates the invalidate re- cation is restructured into a producer-consumer pair where
quest to make the cache line exclusive. The placement pol- the producer does the red computation while the consumer
icy is the same as PUSH-S-L1. does the black.
PUSH-S-L2: The prepush policy is the same as PUSH- Figure 3 illustrates the RB solver grid structure and the
S-L1. The placement policy is to write the prepushed cache producer’s and consumer’s positions in shared and exclu-
line to the L1 cache if it is not full, otherwise, writing the sive prepushing models. For each point in the current row,
cache line to the L2 cache, evicting cache lines, if needed. the previous and next rows are needed. In exclusive pre-
PUSH-X-L2: The prepush policy is the same as PUSH- pushing models, the synchronization window between the
X-L1, while the placement policy is the same as PUSH-S- producer and consumer is larger. The producer cannot
L2. prepush the row it has just finished processing (i.e. cur-
The implementation must also handle the race condition rent row), because exclusive prepushing invalidates the data
when the consumer issues an explicit cache line request be- needed for the next row, so, the producer has to prepush
fore the prepushing of that particular cache line takes place. the previous row. This larger synchronization window does
After the prepusher calculates the cache line address, it not degrade performance; in fact, it reduces the consumer’s
checks the global request table to see if the corresponding remote cache requests as discussed in Section 5.
cache line has been explicitly requested. If so, prepush-
Finite Difference Time Domain (FDTD): FDTD is an ex-
ing of that cache line gets canceled. If this condition is not
tremely memory-intensive electromagnetic simulation. It
handled, the cache line will be sent to the remote cache,
uses six three-dimensional arrays, three of which constitute
incurring extra network traffic.
the electric field and the other three constitute the magnetic
field. During each time step, the magnetic field is updated
4. Benchmarks using values of the electric field from the previous time step.
Then, the electric field is updated using values of the mag-
We used three hand-coded producer-consumer applica- netic field computed in the current time step. The applica-
tions as benchmarks for our research. Temporal parallelism tion is restructured into a producer-consumer pair where the
is exploited by running the producer and consumer concur- producer does the magnetic field update while the consumer
rently. The producer fetches data it needs from the mem- does the electric field update.
ory, generates its results, and updates the data in the cache. ARC4 Stream Cipher (ARC4): ARC4 is the Alleged RC4,
While the data is still in the cache, it is fetched and used by a stream cipher commonly used in protocols such as the
the consumer. The producer and consumer need to be syn- Secure Sockets Layer (SSL) and Wired Equivalent Privacy
chronized to prevent the producer from getting far ahead of (WEP) in wireless networks. The encryption process uses a
the consumer. Therefore, there is a synchronization window pseudo-random number generator to generate a keystream
between the producer and consumer to make producer’s of pseudo-random bits. As each byte of the keystream is
data available in the cache for consumer’s access. generated, it is XORed with the corresponding byte of the
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
CPU 2 GHz, UltraSPARC III+ processors
L1 Cache 64 KB, 2-way associative, 1 cycle
L2 Cache 1 MB, 16-way associative, 12 cycles
Cache Line Size 64 B
Memory 4 GB, 120 cycles
Operating System Solaris 9
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
(a) L1D Misses (b) L2 Misses
most achieving the ideal outcome. Due to space limitations, ment because the consumer always misses its L1 cache but
we cannot provide our results here. hits in its L2 cache. The number of consumer’s L2 misses
FDTD: The FDTD application uses six three- decreases by 69-73% in all the prepushing models.
dimensional grids of doubles, so the grid size of 20x20x20 In the 20x20x20 version, shared requests are reduced by
fits in the L2 cache while the other grid sizes don’t. Table 3 69-74%. There is an improvement of 72-74% in larger data
shows the register-based synchronization results. The gain sizes (see Figure 6(a)). As for the exclusive data requests,
of using hardware registers is 6-11% per iteration. the 20x20x20 version and shared prepushing models do not
The execution time is reduced by 19-21% in the show any improvement. However, there is a 76% reduction
20x20x20 case, in which the data is prepushed both ways in the exclusive prepushing models because there is no need
because the data size is so small that the caches do not get to request the data in exclusive mode (see Figure 6(b)).
polluted. In the other cases, there is only one-way prepush- ARC4: Table 4 shows the register-based synchroniza-
ing and the improvement is 26-29% in shared prepushing tion results. Neither stream size fits in the L2 cache. The
and 46-49% in exclusive prepushing. The improvement in gain of using this approach is negligible, as the producer
exclusive prepushing is due to the reduced number of re- and consumer rarely wait for each other.
mote cache requests, as illustrated in Figures 6(a) and 6(b). The execution time is reduced by 16-17% by prepush-
The consumer’s L1D misses are reduced by 18% for ing. Figure 5(a) shows the number of consumer L1D misses
20x20x20, and 42-48% in larger data sizes (see Figure 5(a)). decreasing by 15-17% when the data is prepushed to the
Again, PUSH-S-L2 and PUSH-X-L2 provide no improve- consumer’s L1 cache. Again, there is no improvement in
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
PUSH-S-L2 and PUSH-X-L2 because the consumer cannot [8] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway.
find the required data in its L1 cache. The number of con- The AMD Opteron Processor for Multiprocessor Servers.
sumer L2 misses decreases by 17% in all prepushing mod- IEEE Micro, 23:66–76, 2003.
els. Figure 6(a) shows the consumer shared requests are [9] D. Koufaty and J. Torrellas. Comparing Data Forward-
reduced by 20% by prepushing. However, there is no im- ing and Prefetching for Communication-Induced Misses in
Shared-Memory MPs. In International Conference on Su-
provement in the consumer exclusive requests (Figure 6(b)).
percomputing, 1998.
[10] D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas.
6. Conclusion Data Forwarding in Scalable Shared-Memory Multiproces-
sors. In International Conference on Supercomputing, July
1995.
Current CMPs provide no explicit synchronization and
[11] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber,
communication support for multithreaded applications. The A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The
register-based synchronization and prepushing techniques Stanford DASH Multiprocessor. IEEE Computer, 25(3):63–
proposed in this paper aim to improve that, particularly for 79, March 1992.
CMPs with private caches. The register-based synchro- [12] P. S. Magnusson, M. Christensson, J. Eskilson, D. Fors-
nization approach employs hardware registers to improve gren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and
performance and help power savings, while the prepush- B. Werner. Simics: A Full System Simulation Platform.
ing approach provides an efficient communications inter- IEEE Computer, 35(2):50–58, February 2002.
face where data can be moved/copied from one cache to [13] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty,
another before it is needed at the destination. Both ap- M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and
D. A. Wood. Multifacet’s General Execution-driven Multi-
proaches are particularly beneficial for producer-consumer
processor Simulator (GEMS) Toolset. In SIGARCH Com-
parallelism. Register-based synchronization shows promis- puter Architecture News, September 2005.
ing results to reduce spin waits, improve resource utiliza- [14] E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, and L. W.
tion, and help power savings. Prepushing significantly re- Dowdy. The KSR1: Experimentation and Modeling The
duces cache misses and remote cache requests while elimi- KSR1: Experimentation and Modeling of Poststore. In ACM
nating latency seen by the consumer processor. Our future SIGMETRICS Conference on Measurement and Modeling of
work will focus on register-based synchronization and pre- Computer Systems, May 1993.
pushing implementations on different architectures, as well [15] S. L. Scott. Synchronization and Communication in the T3E
as on further architecture optimizations. Multiprocessor. In International Conference on Architec-
tural Support for Programming Languages and Operating
Systems, 1996.
References [16] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy.
Supporting Fine-Grained Synchronization on a Simultane-
[1] H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An Eval- ous Multithreading Processor. In International Symposium
uation of Fine-Grain Producer-Initiated Communication in on High Performance Computer Architecture, January 1999.
Cache-Coherent Multiprocessors. In International Sympo- [17] S. Vadlamani and S. Jenks. The Synchronized Pipelined Par-
sium on High-Performance Computer Architecture, Febru- allelism Model. In International Conference on Parallel and
ary 1997. Distributed Computing and Systems, November 2004.
[2] M. C. August, G. M. Brost, C. C. Hsiung, and A. J. Schif- [18] S. Vadlamani and S. Jenks. Architectural Considerations for
fleger. Cray X-MP: The Birth of a Supercomputer. IEEE Efficient Software Execution on Parallel Microprocessors.
Computer, 22(1):45–52, 1989. In International Parallel and Distributed Processing Sym-
[3] G. T. Byrd and M. J. Flynn. Producer-Consumer Commu- posium, March 2007.
nication in Distributed Shared Memory Multiprocessors. In [19] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, C. Gni-
Proceedings of the IEEE, March 1999. ady, A. Ailamaki, and B. Falsafi. Store-Ordered Streaming
[4] M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, of Shared Memory. In International Conference on Parallel
Y. Gurevich, and W. S. Lee. The M-Machine Multicomputer. Architectures and Compilation Techniques, 2005.
In International Symposium on Microarchitecture, 1995.
[5] S. Frank, H. B. III, and J. Rothnie. The KSR1: Bridging the
Gap Between Shared Memory and MPPs. In IEEE Com-
puter Society Computer Conference, February 1993.
[6] M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins,
Y. Watanabe, and T. Yamazaki. Synergistic Processing in
Cell’s Multicore Architecture. IEEE Micro, 26(2):10–24,
2006.
[7] A. Gupta and W.-D. Weber. Cache Invalidation Patterns
in Shared-Memory Multiprocessors. IEEE Transactions on
Computers, 41(7):794–810, 1992.
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.