0 penilaian0% menganggap dokumen ini bermanfaat (0 suara)
26 tayangan0 halaman
Multiprocessors are usually designed for at least one of two reasons - Fault Tolerance - Program Speed up. A shared memory multiprocessor is a multiprocessor with a single processor executing a single task at a time. The goal of partitioning is to uncover the maximum amount of parallelism possible within certain obvious machine limitations.
Multiprocessors are usually designed for at least one of two reasons - Fault Tolerance - Program Speed up. A shared memory multiprocessor is a multiprocessor with a single processor executing a single task at a time. The goal of partitioning is to uncover the maximum amount of parallelism possible within certain obvious machine limitations.
Hak Cipta:
Attribution Non-Commercial (BY-NC)
Format Tersedia
Unduh sebagai PDF, TXT atau baca online dari Scribd
Multiprocessors are usually designed for at least one of two reasons - Fault Tolerance - Program Speed up. A shared memory multiprocessor is a multiprocessor with a single processor executing a single task at a time. The goal of partitioning is to uncover the maximum amount of parallelism possible within certain obvious machine limitations.
Hak Cipta:
Attribution Non-Commercial (BY-NC)
Format Tersedia
Unduh sebagai PDF, TXT atau baca online dari Scribd
Echelon Institute of Technology Unit 5 Concurrent Processors 1. Vector Processors 2. Vector Memory 3. Multiple Issue Machines 4. Comparing Vector and Multiple Issue Processors Shared Memory Multiprocessors 1. Basic Issues: Partitioning, Synchronization and Coherency 2. Types of Shared Memory Multiprocessors 3. Memory Coherence in shared Memory Multiprocessors 2 Shared Memory Multiprocessors Shared Memory Multiprocessors Multiprocessors are usually designed for at least one of two reasons Fault Tolerance Program Speed up Fault Tolerant Systems: n identical processors ensure that failure of one processor does not affect the ability of the multiprocessor to continue with program execution. These multiprocessors are called high availability or high integrity systems. These systems may not provide any speed up over a single processor system. Program Speed up: Most multiprocessors are designed with main objective of improving program speed up over that of single processor. Yet fault tolerance is still an issue as no design for speedup ought to come at the expense of fault tolerance. It is generally not acceptable for whole multi processor system to fail if any one of its processors fail. 3 Shared Memory Multiprocessors Basic Issues: Three basic issues are associated with designs of multiprocessor systems. Partitioning Scheduling of tasks Communication and synchronization Partitioning This is the process of dividing a program into tasks, each of which can be assigned to an individual processor for execution. The partitioning process occurs at compile time. The goal of partitioning process is to uncover the maximum amount of parallelism possible within certain obvious machine limitations. Suppose a program P is converted into a parallel form Pp. This conversion consists of partitioning Pp into a set of tasks Ti. Pp consists of tasks, some of which can be executed concurrently (parallel). 4 Program partitioning is usually performed with some program overhead. Overhead affects speedup. The larger the size of the minimum task defined by the partitioning program, the smaller the effect of program overhead. If uniprocessor program P1 does operation O1, then parallel version of P1 does operations Op, where Op O1. If available parallelism exceeds the known number of processors, or several shorter tasks share the same instruction / data working set, Clustering is used to group subtasks into a single assignable task. Partitioning The detection of parallelism is done by one of three methods. 1. Explicit statement of concurrency in high level language. Programmers to define boundaries among tasks that can be executed in parallel. 2. Programmers hint in source statement which compilers can use or ignore. 3. Implicit parallelism sophisticated compilers can detect parallelism in normal serial code and transform program code for execution on multiprocessors. Partitioning Grouping tasks into process clusters. 5 Scheduling Associated with each programs flow of control among the sub-program tasks. Each task is dependent on others. Scheduling Scheduling is done both statically (at compile time) and dynamically (at run time). Statically scheduling is not sufficient to ensure optimum speedup or even fault tolerance. The processor availability is difficult to predict and may vary from run to run. Runtime scheduling has advantage of handling changing system environments and program structures also having disadvantage of run time overhead. Major run time overheads in run-time scheduling: 1. Information gathering: information about dynamic state of the program and the state of the system. 2. Scheduling 3. Dynamic execution control: clustering or process creation 4. Dynamic data management: providing of tasks and processors in such a way to minimize the required amount of memory overhead delay. 6 Run Time Scheduling Techniques 1. System load balancing: Objective is to balance the systems loading. Dispatching the number of ready tasks to each processors execution queue. 2. Load balancing: relies on estimates of the amount of computation needed within each concurrent sub-task. 3. Clustering: pair wise communication process information developed to minimize inter-process communication. 4. Scheduling with complier assistance: block level dynamic program information is gathered at run time. 5. Static scheduling / custom scheduling: inter-process communication and computational requirements can be determined at compile time. Scheduling Synchronization and Coherency 7 Synchronization and Coherency Multiprocessor configuration having high degree of task concurrency, the tasks must follow an explicit order and communication between active tasks must be performed in an orderly way. The value passing between different tasks executing on different processors is performed by synchronization primitives or semaphores. semaphore is a variable or abstract data type that provides a simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment. Synchronization is the means to ensure that multiple processors have a coherent or similar view of critical values in memory. Memory coherence is the property of memory that ensures that a read operation returns the same value which was stored by the latest write to same address. In complex systems of multiple processors the program order of memory related operations may be different from order in which operations are actually executed. Different degrees of operation ordering in multiprocessors: 1. Sequential Consistency 2. Processor Consistency 3. Weak Consistency 4. Release Consistency Synchronization and Coherency Fig. a: Sequential Consistency: result of any execution is same as operations of all processors executed in some sequential order. Fig. c: Weak Consistency: Synch. Operations are performed before any subsequent memory operation is performed and all pending memory operations are performed before any synchronization operation is performed. Fig. d: Release Consistency: Synch. operations are split into acquire (lock) and release (unlock) and these operations are processor consistent. Fig. b: Processor Consistency: Loads followed by store to be performed in program order. But store followed by load is not necessarily performed. 8 Types of Shared Memory Multiprocessors Types of Shared Memory Multiprocessors The variety in multiprocessors results from the way memory is shared between processors. 1. Shared data cache, shared memory. 2. Separate data cache but shared bus shared memory. 3. Separate data cache with separate buses leading to a shared memory. 4. Separate processors and separate memory modules interconnected with a multi-stage interconnection network. 9 Types of Shared Memory Multiprocessors 2. Separate data cache but shared bus-shared memory Types of Shared Memory Multiprocessors 4. Separate processors and separate memory modules interconnected with a multi-stage interconnection network. 10 Memory Coherence in Shared Memory Multiprocessors Memory Coherence in Shared Memory Multiprocessor Each node in a multiprocessor system possesses a local cache. Since the address space of the processor overlaps, different processors can be holding (caching) the same memory segment at the same time. Further each processor may be modifying these cached location simultaneously. Cache coherency problem is to ensure that all caches contain same most updated copy of data. The protocol that maintains the consistency of data in all the local caches is called the cache coherency protocol. Snoopy Protocol Directory Protocol 11 Snoopy Protocol A write is broadcast to all processors in the system. Broadcast protocols are usually reserved for shared bus multiprocessors. All processors share memory through a common memory bus. Snoopy protocols assume that all of the processors are aware and receive all bus transactions (snoop on the bus). Bus Processors Snoopy Protocol Snoopy Protocol Memory 12 Snoopy protocols are further classified based on the type of action local processor must take when an altered line is recognized. There are two types of actions: Invalidate: all copies in other caches are invalidated before changes are made to data in a particular line. The invalidate signal is received from the bus and all caches which groups the same cache line invalidate their copies. Update: Writes are broadcast on the bus and caches sharing the same line snoop for data on the bus and update the contents and state of their cache lines. Snoopy Protocol Snoopy Protocol Three processors (P1, P2 and Pn) having consistent copy of block X in their local caches. Using write invalidate, processor P1 writes its cache from X to X' and all other copies are invalidated via bus. Write update demands the new block content X' be broadcast to all cache copies via bus. 13 Directory Based Protocols Directory Based Protocols These protocols maintain the state information of the cache lines at a single location called directory. Only the caches that are stated in the directory and are thus known to posses a copy of the newly altered line, are sent write update information. Since there is no need to connect to all caches, in contrast to snoopy protocols, directory based protocols can scale better. As the number of processor nodes and number of cache lines increase, the size of directory can become very large. 14 Based on action taken by local processor in terms of invalidate or update cache lines, there is important distinction among directory based protocols depending on directory placement. Central Directory: Directories specifying line ownership or usage can be kept with memory (central). Distributed Directory: Directories specifying line ownership or usage can be kept with the processor-caches (distributed). Directory Based Protocols Central Directory: Memory will be updated on a write update. Directory is associated with the memory and contains information for each line in memory. Each line has a bit vector which contains a bit for each processor cache in the system. Directory Based Protocols Centralized directory structure Directory size (bits) = Number of caches x Number of memory lines 15 Distributed Directory: Memory has only one pointer which identifies the cache that last requested it. A subsequent request to that line is then referred to that cache and requestor Id is placed at the head of list. Main memory is generally not updated and the true picture of the memory of the memory state is found only in group of caches. Directory Based Protocols Directory size (bits) = Number of memory lines x [ log2 (number of caches)] Concurrent Processors 16 Concurrent Processors Processors that can execute multiple instructions at the same time (Concurrently) Concurrent processors can make simultaneous access to memory and can execute multiple operations simultaneously. Processor performance depends on compiler ability, execution resources and memory system design. There are two main types of concurrent processors. Vector Processors: single vector instruction replaces multiple scalar instructions. It depends on compilers ability to vectorize the code to transform loops into a sequence of vector operations. Multiple Issue Processors: Instructions whose effects are independent of each other are executed concurrently. Vector Processors 17 Vector Processors A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors. Such machines are especially useful in high-performance scientific computing, where matrix and vector arithmetic are quite common. Supercomputers like Cray Y-MP is an example of vector Processor. Vector processor is an ensemble of hardware resources, including vector registers, register counters etc. Vector processing occurs when arithmetic or logical operations are applied to vectors. Vector processors achieve considerable speed up in processor performance over that of simple pipelined processors. Vector Processors Six types of Vector Instructions 1. Vector Vector Instruction 2. Vector Scalar Instructions 3. Vector Memory Instructions 4. Vector - Reduction Instruction 5. Gather and Scatter Instructions 6. Masking Instructions 18 Vector - Vector Instructions One or two operands are fetched from the respective vector registers and produce results in another vector register. Vj x Vk Vi Vector Scalar Instructions Vector element is multiplied by scalar element to produce vector of equal length s x Vi Vj 19 Vector - Memory Instructions M V (Vector Load) V M (Vector Store) Gather and Scatter Instruction Two vector registers are used to gather or to scatter vectors elements randomly throughout the memory. M V1 x V0 Gather V1 x V0 M Scatter Gather Scatter 20 Masking Instruction Mask vector is used to compress or to expand a vector to a shorter or long index vector respectively. V0 x Vm V1 VM register = Masking Register VL register = Length of vector being tested Vector Memory 21 Vector Memory Simple low order interleaving used in normal pipelined processors is not suitable for vector processors. Since access in case of vectors is non sequential but systematic, thus if array dimension or stride (address distance between adjacent elements) is same as interleaving factor then all references will concentrate on same module. It is quite common for these strides to be of the form 2 k or other even dimensions. So vector memory designs use address remapping and use a prime number of memory modules. Hashed addresses is a technique for dispersing addresses. Hashing is a strict 1:1 mapping of the bits in X to form a new address X based on simple manipulations of the bits in X. A memory system used in vector / matrix accessing consists of following units. Address Hasher 2 k + 1 memory modules. Module mapper. This may add certain overhead and add extra cycles to memory access but since the purpose of the memory is to access vectors, this can be overlapped in most cases. Vector Memory 22 Modeling Vector Memory Performance Vector memory is designed for multiple simultaneous requests to memory. Operand fetching and storing is overlapped with vector execution. Three concurrent operand access to memory are a common target but increased cost of memory system may limit this to two. Chaining may require even more accesses. Another issue is the degree of bypassing or out of order requests that a source can make to memory system. In case of conflict i.e. a request being directed to a busy module, the source can continue to make subsequent requests only if not serviced requests are held in a buffer. Assume each of s access ports to memory has a buffer of size TBF/s which holds requests that are being held due to a conflict. For each source, degree of bypassing is defined as the allowable number of requests waiting before stalling of subsequent requests occurs. If Qc is the expected number of denied requests per module and m is the number of modules, then buffer size must be large enough to hold denied requests. Buffer = TBF > m. Qc If n is the total number of requests made and B is the bandwidth achieved then m . Qc = n-B (denied requests) Modeling Vector Memory Performance Typical Buffer entries include: 1. Request source ID. 2. Request source tag. (i.e VR number) 3. Module ID 4. Address for request to a module 5. Scheduled cycle time indicating when module is free. 6. Entry priority ID 23 Gamma () Binomial Model Assume that each vector source issues a request each cycle( =1) and each physical requestor has the same buffer capacity and characteristics. If the vector processor can make s requests per cycle and there are t cycles per Tc, then Total requests per Tc = t . s = n This is same as n requests per Tc in simple binomial model. If is the mean queue size of bypassed requests awaiting service then each of buffered requests also make a request. From memory modeling point of view this is equivalent to buffer requesting service each cycle until module is free. Total request per Tc = t . s + t . s . = t. s(1 + ) = n (1 + ) Using simple binomial equation: Calculating opt The is the mean expected bypassed request queue per source. If we continue to increase number of bypass buffer registers we can achieve a opt which totally eliminates contention. No contention occurs when B = n or B(m,n,) = n This occurs when a = = n/m Since MB/ D/1 queue size is given by Q = a 2 p a / 2(1- a) = n(1+) B /m Substituting a = = n/m and p=1/m we get: Q=( n 2 - n) / 2((m 2 - nm) = (n/m )(n-1) /(2m-2n) Since Q = (n(1+) B) /m So mQ = n(1+) B Now for opt (n - B) =0 So opt = m/n Q So opt = n - 1/ 2m-2n And mean total buffer size (TBF) = n opt To avoid overflow buffer may be considerably larger may be 2 x TBF 24 Multiple Issue Machines Multiple Issue Machines These machines evaluate dependencies among group of instructions, and groups found to be independent are simultaneously dispatched to multiple execution units. There are two broad classes of multiple issue machines Statically Scheduled: detection process is done by the compiler. Dynamically Scheduled: Detection of independent instructions is done by hardware in the decoder at run time. 25 Statically Scheduled Machines Sophisticated compilers search the code at compile time and instructions found independent in their effect are assembled into instruction packets, which are decoded and executed at run time. Statically scheduled processors must have some additional information either implicitly or explicitly indicating instruction packet boundaries. Early statically scheduled machines include the so called VLIW (Very long instruction word) machines. These machines use an instruction word that consists of 8 to 10 instruction fragments. Each fragment controls a designated execution unit. To accommodate multiple instruction fragments the instruction word is typically over 200 bits long. The register set is extensively multi ported to support simultaneous access to multiple execution units. To avoid performance limitation by occurrence of branch instructions a novel compiler technology called trace scheduling is used. In trace scheduling branches are predicted where possible and predicted path is incorporated into a large basic block. If an unanticipated ( or unpredicted) branch occurs during the execution of the code, at the end of the basic block the proper result is fixed up for use by a target basic block. Statically Scheduled Machines 26 Dynamically Scheduled Machines In dynamically scheduled machines detection of independent instruction is done by hardware at run time. The detection may also be done at compile time and code suitably arranged to optimize execution patterns. At run time the search for concurrent instructions is restricted to the localities of the last executing instruction. Superscalar Machines The maximum program speed up available in multiple issue machines, largely depends on sophisticated compiler technology. The potential speedup available from multi-flow compiler using trace scheduling is generally less than 3. Recent multiple issue machines having more modest objectives, are called Superscalar Machines. The ability to issue multiple instructions in a single cycle is referred to as Superscalar implementation. 27 Comparing Vector and Multiple Issue Processors The goal of any processor design is to provide cost effective computation across a range of applications. So we should compare the two technologies based on following two factors. Cost Performance 28 Cost Comparison While comparing the cost we must approximate the area used by both the technologies in the form of additional / required units. The cost of execution units is about the same for both (for same maximum performance). A major difference lies in the storage hierarchy. Both rely heavily on multi ported registers. These registers occupy significant amount of area. If p is the no of ports, the area required is Area = (No of reg +3p)(bits per reg +3p) rbe. Most vector processors have 8 sets of 64 element registers with each element being 64 bit in size. Each vector register is dual ported ( a read port and a write port). Since registers are sequentially accessed each port can be shared by all elements in the register set. There is an additional switching overhead to switch each of n vector registers to each of p external ports. Switch area = 2 (bits per reg).p. (no of reg) So area used by registers set in vector processors (supporting 8 ports) is Area = 8x[(64+6) (64+6)] =39,200 rbe. Switch area = 2 (64).8.(64) = 8192 rbe. A multiple issue processor with 32 registers each having 64 bits and supporting 8 ports will require Area = (32+3(8))(64+3(8)) =4928 rbe So vector processors use almost 42,464 rbe of extra area compared to MI processors. This extra area corresponds to about 70,800 cache bits (.6 rbe / bit) i.e. approximately 8 KB of data cache. Vector processors use small data cache. Cost Comparison 29 Multiple issue machines require larger data cache to ensure high performance. Vector processors require support hardware for managing access to memory system. Also high degree of interleaving is required in memory system to support processor bandwidth M.I machines must support 4-6 reads and 2-3 writes per cycle. This increases the area required by buses between arithmetic units and registers. M.I machines should access and hold multiple instructions each cycle from I- cache This increases the size of I-fetch path between I-cache and instruction decoder/ instruction register. At instruction decoder multiple instructions must be decoded simultaneously and detection for instruction independence must be performed. Cost Comparison Performance Comparison The performance of vector processors depends primarily on two factors Percentage of code that is vectorizable. Average length of vectors. We know that n 1/2 or the vector size at which the vector processor achieves approx half its asymptotic performance is roughly the same as arithmetic plus memory access pipeline. For short vectors data cache is sufficient in MI machines so for Short Vectors M.I .processors would perform better than equivalent vector processor. As vectors get longer the performance of M.I. machine becomes much more dependent on size of data cache and n 1/2 of vector processors improve. So for long vectors performance would be better in case of vector processor. The actual difference depends largely on sophistication in compiler technology. Compiler can recognize occurrence of short vector and treat that portion of code as if it were a scalar code . 30 References Computer Architecture by Michael J.Flynn Advanced Computer Architecture by Kai Hwang