Anda di halaman 1dari 5

(and silicon) they’ll never use.

Addition-

Developing FEATURE ally, the device may lack certain func-


tions that could significantly enhance

ARTICLE the system performance of a given

a Custom Joe Circello &


application if integrated on-chip.
But, a new system design model is
emerging, brought on by the availabil-
ity of modular, fully synthesizable,

Integrated Sylvia Thirtle process-independent microprocessor


cores. For the first time, design engi-
neers have unprecedented control over

Processor defining and configuring embedded


processors.
Customizable cores, like the Motor-
ola ColdFire family, can be cost-effec-
tively tailored to meet the demands of
Analyzing the Price/Performance specific applications. The ColdFire
architecture was developed to address

Tradeoff this class of applications.


Based on variable-length RISC tech-
nology, ColdFire combines the archi-
tectural simplicity of conventional
32-bit RISC with a memory-saving,
variable-length instruction set. In

n ew uses for ad-


vanced embedded
microprocessors are
defining the ColdFire architecture,
Motorola incorporated a RISC-based
processor design and a simplified ver-
sion of the variable-length instruction
set found in the 68k family.
If standard processor emerging everywhere, The result is a family of 32-bit
especially in the highly competitive, microprocessors suited for those em-
configurations aren’t fast-paced market of consumer elec- bedded applications requiring high
tronics. Thanks to cooperative efforts performance in a small core size. The
quite what you need, with silicon vendors, embedded-system ColdFire family provides balanced
developers can manipulate powerful system solutions to a variety of em-
consider a processor- variables in the price/performance bedded markets. Here are some of the
equation that were previously beyond basic philosophies that have guided
independent core. their control. all ColdFire designs.
Optimizing an embedded processor
Joe and Sylvia bring presents an earnest challenge and it
When it comes to small, fully
synthesizable processor cores, devel-
requires the system designer to per-
to your doorstep a form a delicate balancing act between
opments are on track with a publicly
announced performance roadmap
customizable core, so performance and cost. Ultimately, this
approach produces an embedded-pro-
reaching 300 MIPS by the year 2001.
Using compiled memory arrays and
you can manipulate cessor solution that is fine-tuned for a
given system and/or application.
100% synthesizable designs enables

valuable variables in Debug unit Mis-


alignment
E-bus

DESIGN MODELS K-to-M Addr


System bus cntrl (SBC)

ColdFire Core

the price/performance With the traditional system design K-bus


Cntrl M-BUS

Data
model, the engineer remains at the RAM ROM Cache Arbitration

equation mercy of standard product offerings cntrl cntrl cntrl module

Control
from the semiconductor vendor. A RAM ROM Cache
Alternate
master
chipmaker’s catalog of standard pro- S-bus

cessor configurations may or may not Slave Slave


peripheral
Slave I/O
peripheral peripheral
include precisely what’s required for a
given application.
With a standard product, system Figure 1—This generalized block diagram shows a
designers may have to pay for functions custom integrated processor using a ColdFire core.

20 Issue 102 January 1999 CIRCUIT CELLAR INK®


system designers to easily a) base CPI [cycles/inst] = summation {F(i) × ET(i)} + Consider the definition of a
define CPU configurations. sequence-related pipeline stalls base average instruction time
Figure 1 depicts the standard b) effective CPI [cycles/inst] = base CPI (base CPI). Let the base CPI
ColdFire microprocessor con- + summation of memory factors represent maximum processor
+ summation of system factors
figuration. The hierarchical performance strictly as a func-
bus structure and the modular c) effective CPI [cycles/inst] = base CPI tion of the instruction mix.
+ IC_miss × IF × IF_stall
architecture are apparent. You + OC_miss × REF × OP_stall
Stated differently, this metric
can add other logic, in the form represents the processor’s per-
where the cache memory degradation factors include:
of predefined macros, from formance assuming the rest of
Motorola’s library or synthe- IC_Miss = Cache miss rate on instruction fetches (Miss/fetch) the system (caches, memory
IF = Instruction fetches per instruction
size your own proprietary IF_stall = Time [cycles] the processor core is installed servicing
modules, etc.) is ideal.
circuits. an instruction fetch miss Figure 2a shows the base
OC_Miss = Cache miss rate on operand fetches (Miss/OPFetch) CPI where the summation
REF = Operand references per instruction
MAXIMUM ARCHITECTURE OP_stall = Time [cycles] the processor core is installed servicing product was previously defined
Fine-tuning a custom em- an operand miss and the sequence-related pipe-
bedded processor for optimal d) effective CPI = base CPI line stalls include all pipeline
price and performance requires + {(IC_Miss × IF) + (OC_Miss × REF)} breaks caused by the instruc-
× {2 + 11 + 0.6 × (12 + 13 + 14)}
some insight into the specific tion sequence. You can calcu-
architecture’s variables. The Figure 2a—Here’s the simplified expression for the processor’s performance late the base CPI by summing
difficulty of this process is measured by effective CPI. b—This generic expression defines performance as the product of the relative
influenced by the sophistica- measured by effective CPI. c—This more detailed equation defines effective frequency of occurrence and
tion of the silicon vendor’s CPI performance for a processor with cache memory. d—And, this is the execution time for each instruc-
effective CPI equation for the ColdFir2 V.2 and V.3 processors.
development environment as tion type plus the sequence-
well as the system designer’s related holds.
ability to provide accurate, real-world account, these simplistic ratings can’t This base CPI provides a parameter
application data for the target system. accurately indicate performance. to quantify the performance of a given
For example, the system OEM may Today, more precise performance processor microarchitecture. To con-
be able to provide information from a estimates of a hypothetical or actual vert this value into a more realistic
previous-generation system. The data processor core can be made. By taking measure of predicted system perfor-
can be a key piece of software that specific system and memory subsystem mance, you have to consider a series
represents a critical execution path of variables into account, this methodol- of degradation factors.
the given application. If possible, the ogy provides a more accurate represen- Let the effective average instruction
ability to extract the key software tation of completely different CPU time (effective CPI) represent this more
routines and recompile them for the configurations and architectures. realistic measure of performance. By
target system makes the process The predicted performance of a quantifying the degradation factors
much easier. processor can be developed using an associated with these other system
As an alternative, trace data captured average-instruction-time methodology. components, the effective CPI can be
from a previous-generation system In its simplest form, this cycles per calculated. As an example, the proces-
can also provide critical information instruction (CPI) metric represents sor stalls resulting from cache misses
for sizing the processor’s local memo- the number of machine cycles per typically represent the largest degra-
ries (e.g., cache, RAM, ROM). These instruction and is calculated for a dation factor in the effective CPI
dynamic traces, whether captured single-issue architecture as: equation.
from an earlier design or created by In Figure 2b, the calculated effec-
the application code running on a tive CPI is reached by summing the
cycles
software simulator of the target sys- CPI = summation F i × ET i individual degradation factors. Let’s
inst
tem, are crucial for the price and per-
formance optimization analysis.
Program
where CPI is the average instruction hex
image ISAsim HW trace
PREDICTING PERFORMANCE time expressed in cycles per instruction, Stream of addresses
Although ratings for microproces- F(i) represents the dynamic frequency (PC, Operand, Instructions)

sors are expressed in MIPS, this num- of occurrence per instruction, and ET(i) CFxPipe Memory
Profile
ber often fails to accurately predict is the execution time for a given in- models models

Base CPI
the performance of an embedded mi- struction i. By summing the product Hit rates

Math
croprocessor system for a given appli- of relative frequency and execution Address
histogram
cation. Many times, these ratings time for each instruction type, the Eff CPI
need a “mileage you get may vary” average instruction time for a processor
disclaimer. Unless the effects of the executing any given instruction mix Figure 3—This diagram gives you an overview of the
memory subsystems are taken into can be calculated. ColdFire performance-analysis methodology.

CIRCUIT CELLAR INK® Issue 102 January 1999 21


executed on different designs, the rela- model can also include a RAM, mapped
Reference count
tive performance equation reduces to to a specific region, for heavily-refer-

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
the product of the architectural and enced operands or code segments. Map-
0

00000000 technology factors. ping the active region of the stack frame
00003000
00006000 to this type of RAM is often effective.
0000c000
0000f000
MODELING TOOLS A second model provides informa-
00019000 Given CPI methodology, a number tion for memory address profiling.
00029000
0003d000 of tools have been developed to assist Using the stream of reference addresses
00040000 in this kind of performance analysis as input, this model profiles the mem-
00043000
00046000 for the ColdFire architecture. ory access patterns to identify critical
00055000
00058000
You can use a number of architec- functions and/or heavily referenced
0005c000 tural models to analyze various factors operand locations. For some systems,
Memory address

00064000
00068000 within the effective CPI performance such profiling helps you understand
0006b000
00000006
equation. These tools are typically the required amount of RAM as well
00079000 high-level C language models of cer- as which variables to map into this
00000007
0008d000 tain functions within the design and space to maximize performance.
0009c000 are driven with information from the Of prime importance is verification
000a7000
000d2000 ColdFire ISA simulator or trace data. of the architectural models. So, at
000d6000
000de000
The ISA model is a C-language various times throughout the analysis
000e1000 program that defines the expected process, the accuracy of the architec-
000eb000
000ee000 results of execution of the instruction tural models is validated.
000f1000
000f4000
set architecture. By inputting a mem- The V.2 processor pipeline archi-
ory image file, the ISA model executes tectural model was initially verified
the program on an instruction-by- by comparing predicted base-CPI val-
Figure 4—The operand address histogram is taken instruction basis, updating all program- ues versus those directly measured
from a set-top box application. visible machine registers and memory from silicon. Reviewing measured
as required. This ISA model is instru- base-CPI values versus those predicted
define the memory subsystem factors mented to optionally output informa- by the pipeline model, the error was
as those associated with a cache tion on instruction fetch, operand less than a 0.5% difference across a
memory, and assume the remaining addresses, and program counter values. large set of embedded benchmarks.
system factors are negligible. By executing the target application The cache architectural models were
The effective CPI equation can on the ISA simulator with the appro- validated against the design descrip-
then be rewritten as in Figure 2c. The priate outputs enabled, a stream of data tions for several ColdFire MPU designs.
first degradation term quantifies the from the executing application can be Another area of interest is the
CPI contribution due to instruction input to one of the architectural models. modeling of the {IF,OP}_stall
fetch cache misses, and the second This input data provides the required times. These degradation factors rep-
term quantifies the operand reference stimulus to the architectural models. resent the pipeline stall that occurs on
cache misses. Processor pipeline models are used a cache miss. For the nonblocking
The relative performance between for base CPI analysis. There’s also a streaming cache designs of the V.2 and
two systems, x and y, can be expressed program that gathers detailed statistics V.3 cores, these terms are modeled as:
as: about dynamic opcode usage. Recall-
ing the base CPI equation, this program {IF, OP}_stall =
provides the F(i) factors associated with (1 + t1) + 1.0 + 0.6 × (t2 + t3 + t4)
x performance the various opcodes for the application.
=
y performance Memory Relative Relative
y eff CPI y cycle time y executed insts ADDITIONAL ANALYSIS MODELS
× × Configuration performance area
x eff CPI x cycle time x executed insts
The ColdFire cache model quantifies
2-KB cache 1.00 1.00
numerous performance parameters for +4-KB RAM 1.05 1.19
where the first ratio defines the archi- various cache sizes, associativity, and 4-KB cache 1.19 1.11
tectural factor, the second ratio is the organizations. It uses the stream of +4-KB RAM 1.27 1.31
technology factor, and the third ratio reference addresses generated by the 8-KB cache 1.52 1.32
+4-KB RAM 1.61 1.52
is the instruction set/compiler factor. simulator as input, and models the 16-KB cache 1.98 1.79
Using the system performance behavior of Harvard and unified caches +4-KB RAM 2.06 1.98
equation, you can analyze the relative of sizes from 512 bytes to 32 KB. 32-KB cache 2.71 2.71
performance of different generations Additionally, the associativity can +4-KB RAM 2.91 2.91
of a microprocessor family, or compare vary between two-way and four-way, Table 1—Here’s the relative performance and area for
different architectures. For benchmarks and the operands can be mapped into various ColdFire configurations executing a set-top box
where the same binary code image is copyback or store-through space. This application

22 Issue 102 January 1999 CIRCUIT CELLAR INK®


where the response measure of predicted
time of the external a) 2.50-3.0 1.00-1.5 b) 10 .. 08 00 -- 11 .. 20 00 .. 42 00 -- 00 .. 64
2.00-2.5 0.50-1.0
1.50-2.0 0.00-0.5 0.60-0.8 0.00-0.2
performance for a given
memory for a line- configuration.
sized fetch is speci-
fied as t1 - t2 - t3 - 3.00 32 KB
1.20 32 KB OPTIMAZATION
t4 when viewed EXAMPLES

Performance/Area
2.50 1.00
16 KB 16 KB

Performance
2.00 0.80
from the micropro- 8 KB 0.60 8 KB
S iz
e To see how the per-
1.50
he
cessor pins. 1.00 Si
ze 0.40
4 KB C
ac formance analysis pro-
4 KB he
Using the equa- 0.50 C
ac 0.20
cedure works, consider
0.00
tion in Figure 2d for 0.00
0 2 KB
0
4 KB
2 KB
the following real-
4 KB RAM Size
the V.2 and V.3 RAM size
world examples.
designs, the relative To begin, let’s say
error between the you are implementing a
Figure 5a—This graph depicts the relative performance as a function of cache and RAM sizes. b—By
predicted and mea- contrast, this graph shows the relative performance per area as a function of cache and RAM sizes. digital set-top box. By
sured effective CPI instrumenting an exist-
was less than 2% across a wide suite in local RAMs or ROMs. The pipeline ing 68k system, trace data is captured
of embedded benchmarks. model produces the base CPI perfor- for two critical execution paths.
Figure 3 summarizes the process. mance metric for a given version of The challenge is to determine the
The architectural models are driven the ColdFire microarchitecture. appropriate amount of local processor
by trace data captured from existing The local-memory models deter- memories (cache and possibly RAM)
hardware or from a compiled applica- mine all the performance parameters to optimize price and performance for
tion executed on the instruction set associated with the cache, RAM, and a V.3 ColdFire design. When imple-
simulator. The resulting streams of ROM modules. The miss ratios are mented in 0.35-µm process technology,
addresses and instructions are then based on size, organization, and the the V.3 core provides 70-Dhrystone,
input to the specific models. dynamic stream of reference addresses. 2.1-MIPS performance when operating
The profiling tool determines any The base CPI and memory parameters at 90 MHz.
hot spots in the code or data areas are combined to produce an effective The trace data is profiled to identify
that might be considered for placement CPI value that provides an accurate any potential hot spots that might
benefit from placement in a RAM.
The profile in Figure 4 shows several
spikes representing heavily referenced
operand areas.
The largest reference area is gener-
ally the system stack and the first
candidate for mapping into a local
RAM. Using the architectural models,
the relative performance and area
calculations across a range of cache
and RAM configurations are given in
Table 1. The reference design is a V.3
core with 2 KB of cache memory.
In Table 1, the relative performance
ranges from 1.0× to 2.6× as a function
of local memory configurations with a
corresponding relative area of 1.0–2.9×.
Depending on system requirements,
the appropriate configuration can be
selected, as shown in Figure 5.
In the second example, a customer
provides a C-language benchmark that
represented four execution paths in a
servo control application. In this real-
time application targeted for a V.2
core, absolute performance in response
to certain interrupts was critical.
There was a fixed amount of time
to service the interrupt and the algo-
rithm implemented a number of digital

24 Issue 102 January 1999 CIRCUIT CELLAR INK®


Relative mark code was compiled and executed Joe Circello works as an advanced micro-
Configuration performance on a V.2 core and its performance mea- processor architect for Motorola’s Semi-
sured. This value provided the reference. conductor Products Sector and was the
CF2, no MAC 1.00x The code was recompiled using C- chief architect for the MC68060 and the
CF2 + MAC with compiler- 1.45x language macros to use MAC instruc- ColdFire family of microprocessors. With
generated MAC instructions tions for arithmetic calculations. The 23 years of experience, he is a veteran
CF2 + MAC with hand- 1.69x compiler-generated MAC assembly- designer specializing in pipeline organi-
optimized MAC instructions language code was optimized by hand zation, control structures, and perfor-
Table 2—Depending on the servo control application, to provide an upper bound of perfor- mance analysis. You may reach Joe at
the relative ColdFire performance will vary. mance. Table 2 shows the results. Joe_Ciecello-rzsx90@email.sps.mot.com.
The baseline core configuration
Sylvia Thirtle is a principal staff engineer
filters. Given the signal-processing included the processor complex with
for Motorola’s Semiconductor Products
nature of the application, this analysis 8 KB of RAM. Including the MAC unit
Sector, specializing in high-speed digital
attempted to quantify the impact of increased this area by only 11% but
ASIC design. In her five years at
the ColdFire multiply-accumulate increased performance by 1.5–1.7×.
Motorola, she’s been involved in various
unit (MAC). The optional MAC is
design activities with ColdFire and is
tightly coupled to the basic execution WHICH MEANS… currently leading the design of the debug
pipeline and is designed to accelerate This analysis methodology pro-
module for the next-generation develop-
signal-processing algorithms. vides a powerful tool, now system
ment. You may reach Sylvia at Sylvia_
Initial analysis indicated that the designers can balance processor perfor-
Thirtle-r24495@email.sps.mot.com.
dynamic frequency of occurrence for mance, clock speed, and relative die size.
multiply instructions (i.e., F(mul)) Given a highly configurable archi-
SOURCE
was ~10%. Applying the MAC pro- tecture, system designers now have
vides faster execution time for multi- access to the key silicon variables ColdFire
ply instructions, reducing ET(mul). needed to create embedded processor Motorola, Inc.
Many implementations of digital solutions optimized for a given appli- (512) 895-2134
filters can be optimized using MAC cation. And, the result? Smart, intui- Fax: (512) 895-8688
instructions directly. First, the bench- tive, and user-friendly products. I www.mot.com/ColdFire

CIRCUIT CELLAR INK® Issue 102 January 1999 25

Anda mungkin juga menyukai