Anda di halaman 1dari 11


Different Software models to exploit thread level parallelism

Parallel processing: Execution of tightly coupled set of threads collaborating on a single task
Request Level Parallelism: Execution of multiple, relatively independent processes that may originate from one or
more users
Multithreading: A technique that supports multiple threads executing in an interleaved fashion on a single multiple
issue processors.
Clusters: Ultrascale computers built from very large number of processors, connected with networking technology.
When these clusters grow to tens of thousands of servers and beyond, we call them Warehouse scale computers.
Multicomputers: Special Large-scale multiprocessor systems, which are less tightly coupled than the typical
multiprocessors but more tightly coupled than the ware-house scale systems.
Grain Size: Amount of computation assigned to a thread.
Multiprocessors are classified based on their memory organization a:
Symmetric (shared-memory) multiprocessor or SMPs or centralized shared-memory multiprocessor: Share a single
centralized memory that all processor have access to, hence symmetric. SMP architecture also called as Unified
Memory Access (UMA) multiprocessors (all processors have a uniform latency from memory).
Distributed Shared Memory Multiprocessor or Non-Uniform Memory Access Multiprocessor (NUMA).
Challenges of Parallel Processing
Limited Parallelism:

Large Latency of Remote access in parallel processor: 35 to 50 clock cycles among cores on the same chip. 100 to
500 clock cycles among cores of separate chips.

Coherence Vs Consistency: Coherence defines memory access behaviour to same memory location; consistency
defines memory access behaviour to different locations. A coherent memory system preserves the order among
accesses to same location by different processes. Whereas a consistent memory system would preserve and
respect the order between accesses to different location issued by a given process.
Write Serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors.
Sequential Consistency: If the results of any execution of a program are such that it is possible to construct a
hypothetical serial order of all operations to the memory (i.e. all locations) that is consistent with the results of the
Cache Coherence Schemes:
Director based: Uses centralized information to avoid broadcast. The sharing status of the a particular block of
physical memory is kept in one centralized location, called directory. It scales well to large number of processors.

Snooping: Rely on broadcast to observe all coherence traffic. Every cache that has a copy of the data from a block
of physical memory could track the sharing status of the block. Well suited for buses and small scale systems.
MSI Protocol:

MESI: Adds the Exclusive state to the basic MSI protocol to indicate when a cache block is resident only in a single
cache but is clean. A block in E state can be written without generating any invalidates optimizes the case where a
block is read by a single cache before being written by that same cache. Also subsequent writes to the bock in E
state by the same core need not acquire the bus access or generate invalidate, since the block is known to be
exclusively in this cache; the processor merely changes the state to MODIFIED. A read miss to a block in E state
would cause a state change S to maintain coherence.
MOESI: Adds Owned state to the MESI protocol to indicate that the associated block is owned by that cache and
out-of-date in memory. In MSI and MESI protocol, when there is an attempt to share a block in the Modified state,
the state is changed to Shared (in original and newly sharing cache) and the block must be written back to memory.
However in MOESI protocol, the block can be changed from MODIFIED to OWNED state in the original cache
without writing to memory. The newly sharing cache would keep the block in the SHARED state. On a miss the
owner of the block must supply the block since it is not up to date in the memory and must write back to memory if
it is replaced.

As the number of processors grows or as the memory demands of each core grows the centralized resource in the
system (main memory or L3 cache) can become a bottleneck.
Snooping bandwidth at the cache also a problem, since every cache must examine every miss placed on the bus.
FALSE SHARING MISSES: Occurs when a block is invalidated because some word in the block, other than the one
being read, is written into. If the word written into is actually used by the processor that received the invalidate,
then the reference was a true sharing reference and would have caused a miss independent of the block size of the
Directory Based Cache Coherence Protocol

In addition to tracking the state of each potentially shared memory block, we must also track which needs have
copies of that block, since those copies needs to invalidate on a write.
The local node is the node where a request originates. The home node is the node where the memory location and
the directory entry reside. The physical address space is statically distributed, so the node that contains the
memory and directory for a given physical address is known. A remote node is the node that has a copy of the
cache block, whether exclusive or shared.