ATC 115paper Hold

Working towards exploiting the different #T cells within the context of an ARM core.
ATC-115
Betina Hold, ARM 10/25/2011
ABSTRACT: Core memories are required to be fast and reliable while operating at low voltages. Traditionally 6T memories are used for Caches, TLBs, TAGs, etc. However, there are several alternatives to the traditional 6T SRAM cell, which offer some interesting benefits, but at a cost. Building a core architecture to exploit these features so as to provide clear benefits requires some careful planning and much look-a-head. Unfortunately, it's not plug-and-play. The price of exploitation of new building blocks and their optimization requires customizations and the price of deciding to go a nontraditional route involves risks and confidence in acceptable costs to the end user. This paper will discuss alternative SRAM options in the context of an ARM core architecture.
Copyright 2011 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged
Introduction: Mobile and embedded electronics are characterized by small microprocessors and microcontrollers. ARM processors, as low-cost solutions, are major players in these markets due to their relative simplicity and low power contributions. Customization in size parameters and features adds to the flexibility of implementation and lends itself to numerous applications of varying nature. Part of this flexibility comes from the building blocks of the cores themselves. However flexibility has its limits as an infinite number of features cannot be supported and a common subset is most advantageous to core providers and designer/users: core designers and implementers often prefer having the choice of independent IP providers and core providers require a finite specification with reasonable amount of options. The building blocks: The building blocks or IP used to synthesize larger cores or even SoCs can be grouped into two main groups, standard cells and memories. There are other building blocks such as specialty hard macros or IOs and they merit their own investigations which are outside to scope of this paper. And for the purposes of this paper, the term memory will refer to an SRAM type memory. Both standard cells and memories are in turn made of different sub components and can exhibit a large range of varying functionalities. Standard cells vary in size, logic functions and in some cases storage behavior, while memory can vary in size, memory functions, memory access options and in some cases memory features. Moving further down the design scale, these components of building blocks are made up of even smaller entities, transistors which themselves have unique and differentiation features. Multiple Vt offerings multiply the number of standard cell libraries available as well as creating a menu of SRAM bitcells. Figure 1 highlights the design scope.
Standard Cells
Memories
Low Power Low Energy
VMIN
Performance
Area
Reliability Stability
Figure 1: Core Building blocks and possible user optimized output options.
Foundry options of varying Vt results in devices with different drive strengths and leakage currents. Applying this library of Vt variances along with the freedom of device sizing results in implementation options which can produce standard cells and memories with varying speed, power, leakage and area characteristics. Consequently, cores and subsequent SoCs can exhibit varying characteristics depending on the many alternatives for the above-mentioned building blocks. However, not to limit the design
scope, a mix of building blocks with different characteristics can also be used to optimize individual subsections of either the IP itself or the end product of a core. Critical paths may be optimized for speed and non-critical paths may be optimized for power/leakage. Cache memories may be optimized for performance or some may be optimized for area. TLBs or TAGs may be optimized for features trading off perhaps area. Accepting all the assortments of device and bitcell, (low level), and IP, (higher level), comes the tradeoff of voltage, area, leakage, performance and functionality. Figure 2 outlines options for building blocks. Each foundry offers a multitude of bitcells with varying read and standby currents: See figures 2(a) and 2(b). Additionally they will also furnish bitcells rated for different nominal voltages, figure 2(c). Functionality is more applicable to the bitcells and the IP derived from it and is illustrated in figure 2(d). For a memory building block there are standard 6Ts with shared and non-shared R/W ports, standard 6T with varying sizes and VT implants, and 8T cells with shared and non-shared R/W ports. There are also larger bitcells with 10+ devices which optimize for power and/or leakage control and those which offer extra functionality in additional R/W ports or those with CAM features.
1800
Istandby versus Iread
1600
300
Standby current versus read current
1400
1200
1000
200
Istandby
800
Istandby
100
600
400
200
(a)
1.2 1 0.8 0.6 0.4 0.2 0 node1
(b)
Iread
IREAD
Technology Voltage Device Offerings
Foundry Functional Offerings

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
F1 F2 F3 F4
node2
(c)
node3
(d)
Designing a system comprising of memories of any sort involves understanding the access mechanisms. For SRAMs, reading and writing are the basics. Exactly how these steps are carried out and what sequence and parallelism are possible depends heavily on the type of bitcell the memory is built from.

Optimizing such a system on the other hand, involves exploiting all possible characteristics and features of these mechanisms from both hardware and software perspectives. In most cases, the parameters of the specification will drive the IP selection. However if the IP set is limited then the specification could in fact be limited as well. If on the other hand specifications were pushed to more speed or parallelism then the IP would have to evolve to cater for a new set of criteria. We have a classic Chicken or egg problem. Should the IP drive the end product or the end product/spec drive the IP? In some cases this is already done for performance requirements. ARM offers a Processor Optimized Package which contains a specific subset of memory instance sizes which fit well with ARM core requirements. Being optimized for performance, the larger generator space is sacrificed. Industry standard for core L1, L2 caches centers on the 6T structure. Optimizing for speed, area or leakage can still remain in the realm of a 6T and thus will not require system access adjustments. However there are some explorations in custom processors or microcontrollers revolving around larger 8T bitcells. In the last two years, ISSCC 2010 and 2011 introduced publications of processing units constructed with 8T L1 caches. The reasons ran from low voltage operation, stability of the bitcell, two port operation, RMW functionality to a fast read and even write in some cases. Additionally, from an implementation point of view, as the IPCs (Instructions per cycles) increase, multiport access is becoming more interesting. Consequently, dual port RAMs would certainly offer some benefits. However, for these to be required in an ARM core, the specifications must be ameliorated to demand the extra functionality of the IP, which in turn demands alternative building blocks. A true dual port, 8TDP is costly in area, timing control and read speed, while a two port 8T, TP8T offers some middle ground between the 8TDP and the 6T. A summary of 6T and TP8T characteristics are shown in Table 1. There are various custom multiport bitcell varieties available, but they do not meet the criteria of a common set of access options/features for IP sharing and they would require too much customization of core logic for each of the numerous memory products they could generate.

6T Property
read capability write capability read stability read disturb bitcell leakage RMW Capability 1X 1X depends on 6T 1X 1x no
8T Property
1-1.7x 1-1.4X Excellent No MUX vs bigger 6T bitcell portion 1.5-1.8X yes
Table 1: 6T and 8T Bitcell tradeoffs
Referring again to Table 1, there are some clear Separate read/write control no yes separate read/write dpath no yes advantages and SER tolerance some more disadvantages. low voltage operation Varies per 6T Gains of 100-300mv Having a very stable area 1x 1.5-1.7x needs to be learned and and separate for read evaluated usability in cores known path in the TP8T read mux capability yes yes permits optimization not with all current offering as write disturb could destroy of the write path. mux'd columns write mux capability yes However not byte write yes depends on above optimizing the cross bit write yes depends on above differential with 8x coupled inverters for SA overhead single ended: 1x-?x overhead. write presents the Architecture regular WL load Extra WL load opportunity for a second read port. Additionally the stability of the bitcell during a write is also up for debate and will determine if the bitcell can be multiplexed or not. More data access design tradeoffs for the memory IP itself surface and consequently more options for a core design present itself. Once again, having new features is attractive, but it comes at an implementation cost. Hence, on top of acceptance, standardization of the TP8T bitcell itself would be required to support a common IP offering which an ARM core could support. Core Implementation considerations Cores like CPUs and microcontrollers use memories to have fast local memory access for instruction and data. In L1 caches, the expectation is that read and write accesses will complete within a little more than a core clock cycle. Most of these computing units have historically been limited to using mainly single ported 6T RAMs for a number of reasons as outlined above. Consequently specifications have been written with this type of building block in mind. There is a setup period for every access which is dominated by core logic leaving little room for the memory Tsu itself. The clock to output, Tq, partially fills the access phase with a restriction that Tq must be sufficiently small to allow for the setup time of the output capture flops. All of this results in strict timing constraints on the maximum width and depth of the caches. Ideally the SRAM access times should be close to the critical logic timing path.

This emphasis on minimizing time comes from serial processing with a traditional 6T memory cell. With only one way in and out, the most effective way to increase throughput is to speed up each of these paths. If on the other hand an additional port is added, accesses may overlap and the Ts and Tq requirements may not be as stringent. With instruction throughput demand growing, multiport solutions are attractive since they offer some parallelism into the serial computing world. However the core or CPU must be aware of the multiple access capabilities of the memory. For the TP8T, a separate read and write port, (1R,1W) are the features to exploit. The option for a read-modify-write, lower Vmin and improved frequency are some traits to highlight. However, cores in general are not designed with if, thens. Its not feasible from an implementation point of view to design a computing unit which caters to either a dual port or a single port as they require unique instruction and data paths as well as branch prediction algorithms. Dual ported RAMs (1R1W), for example, would open up new micro architectural options such as reducing the need to bank memories . Banked memories are used to simulate dual port RAMs by splitting a logical RAM into multiple RAMs by bank. This would provide potential performance benefit by reducing or eliminating bank conflicts. In other cases, having dual port RAMs would open up new micro architectural options (allowing implementation of more complex branch predictors, for example.) However, the inclusion of dual port RAMs would be something micro architects would need to know up front as it must be properly inserted into the initial micro architecture modeling for a new core. Incorporating a L1 cache constructed from TP8T bitcells is a fundamental design decision and requires a whole new way of thinking, which will necessitate new configurations within an architecture specification. This way of thinking must occur early on both at the hardware design phase (synthesis or hard marcros) and the software instruction generation stage. The resulting configuration is again guided by performance, power and area requirements, along with functionality considerations. Design implementation aside, ARM soft cores are created to work with generic RAMs. Currently generic RAMs consist of flavors of 6T memories. Until the TP8T benefits are seen and the demand rises, TP8T bitcells are limited to hard macros and special custom processors. Therein lies the challenge for a TP8T breakout. Next Steps: Making a choice to go with a TP8T RAM is one of many steps required in the design of an 8T incorporated ARM core. Before that decision is made, the 8T cell requires some improvements of its own. The metrics and yield enhancement steps which have gone into the design of 6T cells in the past must be put towards the 8T bitcells. Demand increases quality. Customers and users of cores must learn the advantages and implications of the new functionality and reshape their hardware and software architectures to exploit the TP8T features. Core designers must quantify the benefits and the foundries must make these cells available with acceptable yield and cost. All in all its the basic demand<-> supply chain model with the key point being to highlight the benefits the TP8T and encourage the demand to follow and be driven by potential lead partners.
References: 1) ISSCC 2011 2) ISSCC 2010 3) ARM Colleagues: Matt Elwood, David Williamson


ATC 115paper Hold

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ATC 115paper Hold

Diunggah oleh

Hak Cipta:

Format Tersedia

Working towards exploiting the different #T cells within the context of an ARM core.

Copyright 2011 ARM Limited. All rights reserved.

Low Power Low Energy

Istandby versus Iread

Standby current versus read current

Technology Voltage Device Offerings

Foundry Functional Offerings

Copyright 2011 ARM Limited. All rights reserved.

Copyright 2011 ARM Limited. All rights reserved.

Table 1: 6T and 8T Bitcell tradeoffs

Copyright 2011 ARM Limited. All rights reserved.

Copyright 2011 ARM Limited. All rights reserved.

Anda mungkin juga menyukai