Anda di halaman 1dari 8

Architecture Optimizations for Synchronization and Communication on Chip

Multiprocessors

Sevin Fide and Stephen Jenks


Department of Electrical Engineering and Computer Science
University of California, Irvine, USA
{sevin.fide, stephen.jenks}@uci.edu

Abstract ory interface. To achieve concurrent execution of multiple


threads on CMPs, applications must be explicitly restruc-
Chip multiprocessors (CMPs) enable concurrent execu- tured to exploit thread level parallelism (TLP), either by the
tion of multiple threads using several cores on a die. Cur- programmer or compiler. It is difficult to fully exploit per-
rent CMPs behave much like symmetric multiprocessors formance offered by today’s CMPs, especially when run-
and do not take advantage of the proximity between cores to ning applications with multiple threads that frequently com-
improve synchronization and communication between con- municate and synchronize with each other. Even though the
current threads. Thread synchronization and communica- memory hierarchy is shared among the processors, there is
tion instead use memory/cache interactions. We propose no explicit synchronization and communication support for
two architectural enhancements to support fine grain syn- multithreaded applications to take advantage of the proxim-
chronization and communication between threads that re- ity between cores.
duce overhead and memory/cache contention. Register- A parallel execution approach, called the Synchronized
Based Synchronization exploits the proximity between cores Pipelined Parallelism Model (SPPM) [17], was introduced
to provide low-latency shared registers for synchronization. to reduce the demand on the memory bus by restructuring
This approach can save significant power over spin waiting applications into producer-consumer pairs that communi-
when blocking events that suspend the core are used. Pre- cate through the cache rather than the memory. The pro-
pushing provides software controlled data forwarding be- ducer fetches data from the memory and modifies it. While
tween caches to reduce coherence traffic and improve cache the data is still in the cache, the consumer accesses it. How-
latency and hit rates. We explore the behavior of these ever, the producer and consumer need to be synchronized
approaches, and evaluate their effectiveness at improving to prevent the producer from getting far ahead of the con-
synchronization and communication performance on CMPs sumer. Therefore, SPPM enforces tight synchronization be-
with private caches. Our simulation results show significant tween the producer and consumer to make the producer’s
reduction in inter-core traffic, latencies, and miss rates. data available in the cache for the consumer’s access. This
approach reduces or eliminates the consumer’s need to fetch
data from the memory, resulting in better performance.
Spin waits are often used to provide synchronization be-
1. Introduction tween multiple threads, and they may be costly if the threads
are asymmetric or there are many synchronization points in
The recent trend to increase microprocessor performance the application. Several past machines employed hardware
by adding cores rather than increasing clock frequency or registers to facilitate synchronization among multiple pro-
instruction level parallelism has resulted in powerful chip cessors. We believe the employment of similar registers in
multiprocessors (CMPs). CMPs are built by incorporating CMPs, along with blocking instructions that can temporar-
several processor cores on a single die, enabling concurrent ily suspend a waiting core, will result in better performance
execution of multiple threads. To date, these cores have and power savings than spin waits.
been very similar to their uniprocessor counterparts, though Memory bandwidth is a potential bottleneck for CMPs
heterogeneous multi-core processor architectures, like the and may lead to poor performance. As each thread works
Cell [6], exist or are proposed. Most CMPs share one independently on its own data, the CMP’s shared memory
or more levels of the memory hierarchy among the cores, interface may be overwhelmed by many simultaneous ac-
with some sharing cache levels and others sharing the mem- cesses. In addition, the threads may cause cache pollution,

978-1-4244-1694-3/08/$25.00 ©2008 IEEE

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
evicting each other’s data from any shared cache. Com- 2 outlines prior related research. Section 3 describes our
munication between multiple threads on CMPs takes place architecture optimizations to improve synchronization and
via the shared memory hierarchy, such as a shared cache communication support. Section 4 presents our benchmark
or main memory. Due to demand-based data transfers and applications, while Section 5 shows the results of our sim-
cache coherence mechanisms, the requesting thread stalls ulations and compares performance with and without our
until the data is delivered. We believe a mechanism to ex- techniques. Finally, Section 6 summarizes our contributions
ploit the cores’ proximity and allow fast communications and outlines future work to be done to further improve syn-
between cores is needed. Previous research has shown the chronization and communication support on CMPs.
benefits of data forwarding to reduce communication la-
tency in distributed shared memory multiprocessors. We 2. Related Research
propose employing a similar data movement concept within
the chip on CMPs with private caches that will extend the
Several machines used hardware registers to provide
communication support offered by today’s CMPs, and re-
synchronization and communication among multiple pro-
sult in reduced communication latency and miss rates.
cessors. The Cray X-MP [2] had several clusters of shared
The hardware optimizations proposed here are applica-
registers for interprocess communication and synchroniza-
ble to many types of communicating threads, but were mo-
tion. The operating system could assign a cluster to several
tivated by SPPM’s behavior on current CMPs, particularly
processors, to enable the use of the cluster for communi-
those with private caches. It is shown in [18] that commu-
cation and synchronization. The Cray T3E’s [15] memory
nicating through the cache coherence mechanism is slower
interface was improved with a set of explicitly-managed ex-
than communicating through memory for some common
ternal registers. All remote communication and synchro-
CMPs. Using Register-Based Synchronization (RBS) will
nization was done between these registers and memory. The
reduce the spin waiting of the threads and cut cache con-
M-Machine [4] also provided synchronization through reg-
tention and overhead. In addition, our approach to move
isters. For register synchronization, full/empty bits were
data as it becomes available (called prepushing), reduces
used to determine when register values were valid. This ap-
the latency experienced by the consumer thread as cache
proach allowed the waiting thread to stall rather than spin.
misses are turned into hits.
The experiment results show that register-based communi-
Similar ideas have been studied in the literature for large- cation and synchronization perform better than memory-
scale machines, but application of RBS and prepushing to based communication and synchronization.
CMPs is new. Our work also focuses on SPPM applica- A hardware-based synchronization approach for simul-
tions, which provide better memory performance on CMPs taneous multithreaded (SMT) processors is presented in
than regular parallel applications. Our implementation does [16]. The presented approach, blocking acquire and re-
not rely on directory cache coherence protocols, which has lease, is a hardware implementation of traditional soft-
been the case in previous studies, but uses broadcast cache ware synchronization approaches. Synchronization through
coherence protocol similar to that of AMD’s Opteron [8]. full/empty registers is also presented. The success in prior
Furthermore, our study targets CMPs with private L1 and work on hardware-based synchronization leads us to em-
L2 caches, which is very different from previous studies. ploy full/empty registers to extend the synchronization sup-
Finally, prepushing forwards data based on hints received port offered by today’s CMPs.
from the application, and hence is software controlled to Data forwarding has been studied on several platforms.
reduce complexity and provide flexibility. The Stanford Dash Multiprocessor [11] provides operations
Our contributions include: (1) We present significant that allow the producer to send data directly to consumers.
architecture optimizations to improve the synchronization The update-write operation sends data to all processors that
and communication support for multithreaded producer- have the data in their caches, while the deliver operation
consumer applications running on CMPs. (2) We evalu- sends the data to specified processors. The KSR1 [5] pro-
ate the performance of a well-established synchronization vides programmers with a poststore operation. When a vari-
mechanism, hardware registers, on today’s CMPs. (3) We able is updated, using poststore causes a copy of the variable
introduce a data forwarding approach, prepushing, tailored to be sent to all caches which contain a copy of that variable.
to CMPs with private caches. (4) We evaluate performance The experimental study of the poststore is presented in [14].
of four different prepushing models, including shared and A compiler algorithm framework for data forwarding is
exclusive prepushing and prepushing to different cache lev- presented in [10]. Write-and-Forward assembly instruction
els. (5) RBS and prepushing show promising results as is inserted by the compiler replacing ordinary write instruc-
architecture optimizations to support synchronization and tions. The same data forwarding approach and prefetch-
communication on CMPs. ing are compared in [9]. Similarly, another study shows
The rest of the paper is organized as follows: Section that prefetching is insufficient for producer-consumer shar-

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
ing patterns and migratory sharing patterns in [1]. A range
of consumer- and producer-initiated mechanisms and their
performance on a set of benchmarks is studied in [3]. Store-
ORDered Streams (SORDS) [19] is a memory streaming
approach to send data from producers to consumers in dis-
tributed shared memory multiprocessors. This approach is
based on the producer-consumer temporal address correla-
tion, which claims that shared values are consumed in ap-
proximately the same order that they were produced.
Although all these studies on data forwarding fall into
the same category as ours, the implementation techniques
and target areas are different. First, our implementation
does not rely on directory cache coherence protocols, but
uses broadcast cache coherence protocol similar to AMD
Opteron’s cache coherence protocol. Second, our study is Figure 1. RBS for SPPM Applications
not tailored to distributed shared multiprocessors, but tar-
gets CMPs with private L1 and L2 caches. Data forward- down the accesses to the synchronization variables and eval-
ing on CMPs is new, because CMPs are new. Third, our uated the gain of using hardware registers for SPPM appli-
approach forwards data based on hints received from the cations on CMPs. To dedicate a set of memory locations
application, and hence it is software controlled to reduce for the evaluation of hardware registers, we wrote a device
complexity and provide flexibility. driver and mapped it to the memory. Then, we accessed
the device as if we were accessing the hardware registers.
The following pseudo-code shows a simple consumer im-
3. CMP Architecture Optimizations plementation in SPPM applications. The data is partitioned
into blocks; a data block is read in each loop iteration.
3.1. Register-Based Synchronization (RBS) Before reading a data block, the consumer has to check
whether the synchronization window has been violated or
not. The synchronization window is the interval between
In earlier systems, hardware registers were used to pro- the producer’s current position and consumer’s current po-
vide synchronization and communication among proces- sition in the shared data array.
sors. The success of this prior work led us to employ for i = 0 to N do
full/empty registers to extend the synchronization support while synchronization window for dataBlock[i] is violated
offered by today’s CMPs. Even though the concept of using do
hardware registers for synchronization and communication wait
is not new, it has not been applied to CMPs before, yet is end while
sorely needed. As discussed in Section 2, research shows read dataBlock[i]
that RBS performs better than memory-based communica- generate results
end for
tion and synchronization.
In SPPM applications, the consumer is usually faster The condition to check the synchronization window vi-
olation includes accessing the synchronization variables.
than the producer because it finds its data in the cache rather
Since the data block size consumed per loop may exceed the
than the memory. Thus, when the consumer is done, it L1 cache size, the synchronization variables will be evicted
has to spin wait until the producer updates the next data from the L1 cache under LRU eviction mechanism. There-
block and grants access. Figure 1 shows our RBS scheme fore, we assume an optimal miss rate of 15% and estimate
for CMPs. The shared register has a full/empty status bit. the average access time to the synchronization variables to
When the producer reaches the synchronization point, it sets be 3 cycles on a CMP system with L1 cache of 1-cycle la-
the register status to full, which wakes the producer up. tency and L2 cache of 12-cycle latency, as shown below:
When the consumer is done, instead of spin waiting and
average access time = L1 latency + miss rate
consuming system resources, it goes into idle mode to save
power. Upon register status change, the consumer wakes up ∗ (L1 latency + L2 latency)
and the process repeats. We consider the synchronization = 1 + 0.15 ∗ (1 + 12)
registers to be shared between cores. ≈3
To evaluate RBS, we modified GEMS [13] Ruby mem-
ory model, which is based on the Simics [12] full system The cost of implementing RBS is the addition of shared
simulation platform. Instead of adding hardware registers hardware registers with full/empty bits to the system. The
to the system, we used memory mapped locations to track latency of accessing those registers would be the same as

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
Figure 2. Execution Behaviors of Conventional Approach and Prepushing

accessing any register on the system. In SPPM applica- thread to a specific processor. This is done to prevent nec-
tions, the number of hardware registers needed to replace essary data from being evicted from the cache. Our simu-
synchronization variables is three; two for the producer and lations take place on a Solaris 9 platform, so we use pro-
consumer positions, and one for the flag that resets the posi- cessor bind function to assign each thread to a specific pro-
tions when both threads are done at the end of each iteration. cessor. To evaluate the prepushing approach, we extended
the GEMS Ruby memory model. We added a prepusher as
3.2. Data Communications via Prepushing a hardware component integrated with the cache controller,
and experimented with MOESI broadcast cache coherence
Data forwarding has been shown to reduce communica- protocol. There is a prepush queue that contains informa-
tion latency and miss rates, as discussed in Section 2. The tion about data that will be forwarded from the producer to
previous studies mostly involve distributed shared memory the consumer. In addition, new events, actions, and state
multiprocessors and directory cache coherence protocols. transitions are added to the cache controller to describe the
In this paper, we present prepushing to facilitate data trans- prepushing behavior. In general, the cost of adding a pre-
fers between a single producer-consumer pair. Our work pusher to the system is considerably small, as it would be
brings the data forwarding idea (i.e. sending data before it implemented by extending the cache coherence protocol.
is requested) to CMPs with private L1 and L2 caches, and The prepushing approach is well suited to SPPM appli-
studies the effects of several prepushing models to improve cations with tightly synchronized producer-consumer pairs.
the performance of producer-consumer applications. Software hints are inserted into the SPPM applications,
Figure 2 illustrates the execution behavior of a single and hence the prepushing approach is software controlled.
producer-consumer pair, comparing the conventional ap- When the producer is done with a block of data, it signals
proach to prepushing. Normally, data is pulled by the con- the prepusher to initiate data movement by providing in-
sumer rather than pushed by the producer. Each data block formation such as data block address, data block size, and
consists of several cache lines. As the consumer accesses destination. The prepusher determines which cache lines
a data block, it issues cache line requests and must wait or comprise the data block and calculates the physical address
stall while each block is retrieved from the remote cache. If of each cache line. Then, messages containing information
each cache line is consumed in less time than it takes to get about cache lines are placed into the prepush queue. The
the next one, prefetching will not be fully effective at mask- cache controller reads the prepush queue and handles the
ing the remote fetch latency. Furthermore, prior prefetched data transfers in shared or in exclusive mode, based on the
cache lines may still have been in use by the producer, so a prepush request type. The data is prepushed to the L1 cache
prefetcher may add to the coherence traffic overhead rather or the L2 cache of the consumer, depending on the prepush-
than reduce it. On the other hand, prepushing allows the ing model. On the consumer side, if a tag exists for a par-
producer to send the data as soon as it is done with it. There- ticular prepushed cache line, it is directly written to the ex-
fore, the consumer receives the data by the time it is needed. isting location. Otherwise, it is written to a newly allocated
In the conventional approach, two operations per cache line location, which may require a cache line replacement. We
are seen on the bus, while only one operation per cache line explored four different prepushing models:
is seen with prepushing. The prepushing approach reduces PUSH-S-L1: The prepush policy is to send each cache
the number of data requests, the number of misses, and the line to the consumer in shared state as soon as it is pro-
communication latency seen by the consumer. duced, while the producer’s cache line assumes owned state.
We do not consider process/thread migration due to a This sharing behavior allows the producer to reuse the data,
context switch, because SPPM applications assign each if necessary. The placement policy puts the prepushed

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
Figure 3. Red Black Solver Synchronization Window

cache line into the L1 cache, which may include evicting Red Black (RB) Solver: The RB solver solves a partial
cache lines from the L1 to the L2 cache. differential equation using a finite differencing method. It
PUSH-X-L1: The prepush policy is to send each cache uses a two-dimensional grid, where each point is updated
line to the consumer in exclusive state as soon as the pro- by combining its present value with those of its four neigh-
ducer no longer needs it, thus invalidating it in the pro- bors. To avoid dependences, alternate points are labeled
ducer’s cache. This behavior is similar to migratory shar- red or black, and only one type is updated at a time. The
ing [7]. It improves performance when the consumer needs values are updated until convergence is reached. The appli-
to write to the cache line, as it eliminates the invalidate re- cation is restructured into a producer-consumer pair where
quest to make the cache line exclusive. The placement pol- the producer does the red computation while the consumer
icy is the same as PUSH-S-L1. does the black.
PUSH-S-L2: The prepush policy is the same as PUSH- Figure 3 illustrates the RB solver grid structure and the
S-L1. The placement policy is to write the prepushed cache producer’s and consumer’s positions in shared and exclu-
line to the L1 cache if it is not full, otherwise, writing the sive prepushing models. For each point in the current row,
cache line to the L2 cache, evicting cache lines, if needed. the previous and next rows are needed. In exclusive pre-
PUSH-X-L2: The prepush policy is the same as PUSH- pushing models, the synchronization window between the
X-L1, while the placement policy is the same as PUSH-S- producer and consumer is larger. The producer cannot
L2. prepush the row it has just finished processing (i.e. cur-
The implementation must also handle the race condition rent row), because exclusive prepushing invalidates the data
when the consumer issues an explicit cache line request be- needed for the next row, so, the producer has to prepush
fore the prepushing of that particular cache line takes place. the previous row. This larger synchronization window does
After the prepusher calculates the cache line address, it not degrade performance; in fact, it reduces the consumer’s
checks the global request table to see if the corresponding remote cache requests as discussed in Section 5.
cache line has been explicitly requested. If so, prepush-
Finite Difference Time Domain (FDTD): FDTD is an ex-
ing of that cache line gets canceled. If this condition is not
tremely memory-intensive electromagnetic simulation. It
handled, the cache line will be sent to the remote cache,
uses six three-dimensional arrays, three of which constitute
incurring extra network traffic.
the electric field and the other three constitute the magnetic
field. During each time step, the magnetic field is updated
4. Benchmarks using values of the electric field from the previous time step.
Then, the electric field is updated using values of the mag-
We used three hand-coded producer-consumer applica- netic field computed in the current time step. The applica-
tions as benchmarks for our research. Temporal parallelism tion is restructured into a producer-consumer pair where the
is exploited by running the producer and consumer concur- producer does the magnetic field update while the consumer
rently. The producer fetches data it needs from the mem- does the electric field update.
ory, generates its results, and updates the data in the cache. ARC4 Stream Cipher (ARC4): ARC4 is the Alleged RC4,
While the data is still in the cache, it is fetched and used by a stream cipher commonly used in protocols such as the
the consumer. The producer and consumer need to be syn- Secure Sockets Layer (SSL) and Wired Equivalent Privacy
chronized to prevent the producer from getting far ahead of (WEP) in wireless networks. The encryption process uses a
the consumer. Therefore, there is a synchronization window pseudo-random number generator to generate a keystream
between the producer and consumer to make producer’s of pseudo-random bits. As each byte of the keystream is
data available in the cache for consumer’s access. generated, it is XORed with the corresponding byte of the

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
CPU 2 GHz, UltraSPARC III+ processors
L1 Cache 64 KB, 2-way associative, 1 cycle
L2 Cache 1 MB, 16-way associative, 12 cycles
Cache Line Size 64 B
Memory 4 GB, 120 cycles
Operating System Solaris 9

Table 1. Simulation Environment

plaintext to generate a byte of the ciphertext. Because the


generation of each byte of the keystream is dependent on the
previous internal state, the process is inherently sequential Figure 4. Normalized Execution Time
and, thus, non-parallelizable using conventional data paral-
lelism models. By treating the keystream generation and pushing. The shared prepushing models take slightly more
the production of ciphertext as producer and consumer, we time due to higher exclusive data requests as illustrated in
exploit the inherent concurrency. Figure 6(b). The reason is because the consumer needs
In our simulations, we used unmodified SPPM producer- to receive the data exclusively, but, the shared prepushing
consumer code as our baseline. Since SPPM performs models leave the data in shared state. Therefore, the con-
better than prefetching in these applications, we compare sumer has to make explicit exclusive data requests.
against SPPM. We added software hints to the applications Figure 5(a) illustrates the consumer’s L1D cache misses
to command the prepusher to do data transfers. These in- for all three benchmarks. In RB solver, the number of con-
structions include the address of data to be prepushed, its sumer L1D misses decreases by 47-48% when the data is
size, destination, and prepush mode (either shared or exclu- prepushed to the consumer’s L1 cache. There is no im-
sive). We use GEMS Ruby memory model as our simula- provement in PUSH-S-L2 and PUSH-X-L2 because the
tion platform. The GEMS Ruby simulator uses an in-order consumer cannot find the required data in its L1 cache and
processor model. The simulated system is a dual-processor has to retrieve it from its L2 cache. This is much better than
SPARC machine with characteristics shown in Table 1. The retrieving the data from either the remote cache or memory.
prepush latency used in our simulations includes the latency The number of consumer’s L2 misses decreases by 93-98%
to fetch the data from the L1 cache and the network latency in all prepushing models as shown in Figure 5(b).
to transfer the data to the destination’s cache. The system Figure 6(a) and Figure 6(b) shows the consumer’s shared
did not use shared L2 cache because we modeled it after and exclusive requests, respectively, for all three bench-
AMD’s dual core machines and used the MOESI broadcast marks. In RB solver, the shared requests are reduced by 93-
cache coherence protocol like the Opteron. 98%. Since the consumer receives the data beforehand, the
number of requests is significantly reduced. As expected,
5. Performance Results the number of exclusive requests is much lower in PUSH-X-
L1 and PUSH-X-L2, due to invalidations of the producer’s
data. The exclusive requests stay the same in the shared
Red Black (RB) Solver: The RB solver uses a two-
prepushing models because the consumer has to issue in-
dimentional grid of doubles, so a grid size of 200x200 fits
validation requests, thus causing a negligible increase in the
in the L2 cache while 400x400 does not. Table 2 shows
execution time in the shared prepushing models.
the total execution time taken by a producer-consumer pair
To avoid the race condition discussed in Section 3, the
in each iteration, the number of synchronization point ac-
prepusher checks to see whether the consumer has explic-
cesses, and the estimated time taken for them. The aver-
itly requested a cache line or not, before pushing it to the
age time to access synchronization variables is 3 cycles (see
consumer. If so, the prepusher cancels the prepushing of
Section 3). When we multiply average access time by num-
that particular cache line. Our simulation results show that
ber of accesses per iteration, we obtain estimated access
the prepusher works very effectively in all benchmarks, al-
time per iteration. The ratio of estimated access time per it-
eration to the execution time per iteration gives us the gain
from RBS. The results show that the gain of using hardware Grid Size 200 x 200 400 x 400
registers for synchronization is 2-5% per iteration. Exec. Time per Iter. (cycles) 2,056,725 13,905,891
All the results shown in our graphs are normalized to Access Count per Iter. 13,531 211,933
SPPM. Figure 4 shows the total execution time for all three Est. Access Time per Iter. 40,593 635,799
benchmarks. In RB solver, there is a 20-39% improve-
ment in shared prepushing and 40-62% in exclusive pre- Table 2. Red Black - RBS Results

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
(a) L1D Misses (b) L2 Misses

Figure 5. Consumer’s Normalized Cache Misses

(a) Shared Data Requests (b) Exclusive Data Requests

Figure 6. Consumer’s Normalized Remote Cache Requests

most achieving the ideal outcome. Due to space limitations, ment because the consumer always misses its L1 cache but
we cannot provide our results here. hits in its L2 cache. The number of consumer’s L2 misses
FDTD: The FDTD application uses six three- decreases by 69-73% in all the prepushing models.
dimensional grids of doubles, so the grid size of 20x20x20 In the 20x20x20 version, shared requests are reduced by
fits in the L2 cache while the other grid sizes don’t. Table 3 69-74%. There is an improvement of 72-74% in larger data
shows the register-based synchronization results. The gain sizes (see Figure 6(a)). As for the exclusive data requests,
of using hardware registers is 6-11% per iteration. the 20x20x20 version and shared prepushing models do not
The execution time is reduced by 19-21% in the show any improvement. However, there is a 76% reduction
20x20x20 case, in which the data is prepushed both ways in the exclusive prepushing models because there is no need
because the data size is so small that the caches do not get to request the data in exclusive mode (see Figure 6(b)).
polluted. In the other cases, there is only one-way prepush- ARC4: Table 4 shows the register-based synchroniza-
ing and the improvement is 26-29% in shared prepushing tion results. Neither stream size fits in the L2 cache. The
and 46-49% in exclusive prepushing. The improvement in gain of using this approach is negligible, as the producer
exclusive prepushing is due to the reduced number of re- and consumer rarely wait for each other.
mote cache requests, as illustrated in Figures 6(a) and 6(b). The execution time is reduced by 16-17% by prepush-
The consumer’s L1D misses are reduced by 18% for ing. Figure 5(a) shows the number of consumer L1D misses
20x20x20, and 42-48% in larger data sizes (see Figure 5(a)). decreasing by 15-17% when the data is prepushed to the
Again, PUSH-S-L2 and PUSH-X-L2 provide no improve- consumer’s L1 cache. Again, there is no improvement in

Grid Size 30x30x30 40x40x40 Stream Size 10 MB 50 MB


Exec. Time per Iter. (cycles) 5,741,480 8,124,822 Exec. Time per Iter. (cycles) 1,327,413 1,330,287
Access Count per Iter. 120,425 291,506 Access Count per Iter. 56 72
Est. Access Time per Iter. 361,275 874,518 Est. Access Time per Iter. 168 216

Table 3. FDTD - RBS Results Table 4. ARC4 - RBS Results

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.
PUSH-S-L2 and PUSH-X-L2 because the consumer cannot [8] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway.
find the required data in its L1 cache. The number of con- The AMD Opteron Processor for Multiprocessor Servers.
sumer L2 misses decreases by 17% in all prepushing mod- IEEE Micro, 23:66–76, 2003.
els. Figure 6(a) shows the consumer shared requests are [9] D. Koufaty and J. Torrellas. Comparing Data Forward-
reduced by 20% by prepushing. However, there is no im- ing and Prefetching for Communication-Induced Misses in
Shared-Memory MPs. In International Conference on Su-
provement in the consumer exclusive requests (Figure 6(b)).
percomputing, 1998.
[10] D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas.
6. Conclusion Data Forwarding in Scalable Shared-Memory Multiproces-
sors. In International Conference on Supercomputing, July
1995.
Current CMPs provide no explicit synchronization and
[11] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber,
communication support for multithreaded applications. The A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The
register-based synchronization and prepushing techniques Stanford DASH Multiprocessor. IEEE Computer, 25(3):63–
proposed in this paper aim to improve that, particularly for 79, March 1992.
CMPs with private caches. The register-based synchro- [12] P. S. Magnusson, M. Christensson, J. Eskilson, D. Fors-
nization approach employs hardware registers to improve gren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and
performance and help power savings, while the prepush- B. Werner. Simics: A Full System Simulation Platform.
ing approach provides an efficient communications inter- IEEE Computer, 35(2):50–58, February 2002.
face where data can be moved/copied from one cache to [13] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty,
another before it is needed at the destination. Both ap- M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and
D. A. Wood. Multifacet’s General Execution-driven Multi-
proaches are particularly beneficial for producer-consumer
processor Simulator (GEMS) Toolset. In SIGARCH Com-
parallelism. Register-based synchronization shows promis- puter Architecture News, September 2005.
ing results to reduce spin waits, improve resource utiliza- [14] E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, and L. W.
tion, and help power savings. Prepushing significantly re- Dowdy. The KSR1: Experimentation and Modeling The
duces cache misses and remote cache requests while elimi- KSR1: Experimentation and Modeling of Poststore. In ACM
nating latency seen by the consumer processor. Our future SIGMETRICS Conference on Measurement and Modeling of
work will focus on register-based synchronization and pre- Computer Systems, May 1993.
pushing implementations on different architectures, as well [15] S. L. Scott. Synchronization and Communication in the T3E
as on further architecture optimizations. Multiprocessor. In International Conference on Architec-
tural Support for Programming Languages and Operating
Systems, 1996.
References [16] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy.
Supporting Fine-Grained Synchronization on a Simultane-
[1] H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An Eval- ous Multithreading Processor. In International Symposium
uation of Fine-Grain Producer-Initiated Communication in on High Performance Computer Architecture, January 1999.
Cache-Coherent Multiprocessors. In International Sympo- [17] S. Vadlamani and S. Jenks. The Synchronized Pipelined Par-
sium on High-Performance Computer Architecture, Febru- allelism Model. In International Conference on Parallel and
ary 1997. Distributed Computing and Systems, November 2004.
[2] M. C. August, G. M. Brost, C. C. Hsiung, and A. J. Schif- [18] S. Vadlamani and S. Jenks. Architectural Considerations for
fleger. Cray X-MP: The Birth of a Supercomputer. IEEE Efficient Software Execution on Parallel Microprocessors.
Computer, 22(1):45–52, 1989. In International Parallel and Distributed Processing Sym-
[3] G. T. Byrd and M. J. Flynn. Producer-Consumer Commu- posium, March 2007.
nication in Distributed Shared Memory Multiprocessors. In [19] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, C. Gni-
Proceedings of the IEEE, March 1999. ady, A. Ailamaki, and B. Falsafi. Store-Ordered Streaming
[4] M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, of Shared Memory. In International Conference on Parallel
Y. Gurevich, and W. S. Lee. The M-Machine Multicomputer. Architectures and Compilation Techniques, 2005.
In International Symposium on Microarchitecture, 1995.
[5] S. Frank, H. B. III, and J. Rothnie. The KSR1: Bridging the
Gap Between Shared Memory and MPPs. In IEEE Com-
puter Society Computer Conference, February 1993.
[6] M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins,
Y. Watanabe, and T. Yamazaki. Synergistic Processing in
Cell’s Multicore Architecture. IEEE Micro, 26(2):10–24,
2006.
[7] A. Gupta and W.-D. Weber. Cache Invalidation Patterns
in Shared-Memory Multiprocessors. IEEE Transactions on
Computers, 41(7):794–810, 1992.

Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:31 from IEEE Xplore. Restrictions apply.

Anda mungkin juga menyukai