Addition-
ColdFire Core
Data
model, the engineer remains at the RAM ROM Cache Arbitration
Control
from the semiconductor vendor. A RAM ROM Cache
Alternate
master
chipmaker’s catalog of standard pro- S-bus
sors are expressed in MIPS, this num- of occurrence per instruction, and ET(i) CFxPipe Memory
Profile
ber often fails to accurately predict is the execution time for a given in- models models
Base CPI
the performance of an embedded mi- struction i. By summing the product Hit rates
Math
croprocessor system for a given appli- of relative frequency and execution Address
histogram
cation. Many times, these ratings time for each instruction type, the Eff CPI
need a “mileage you get may vary” average instruction time for a processor
disclaimer. Unless the effects of the executing any given instruction mix Figure 3—This diagram gives you an overview of the
memory subsystems are taken into can be calculated. ColdFire performance-analysis methodology.
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
the product of the architectural and enced operands or code segments. Map-
0
00000000 technology factors. ping the active region of the stack frame
00003000
00006000 to this type of RAM is often effective.
0000c000
0000f000
MODELING TOOLS A second model provides informa-
00019000 Given CPI methodology, a number tion for memory address profiling.
00029000
0003d000 of tools have been developed to assist Using the stream of reference addresses
00040000 in this kind of performance analysis as input, this model profiles the mem-
00043000
00046000 for the ColdFire architecture. ory access patterns to identify critical
00055000
00058000
You can use a number of architec- functions and/or heavily referenced
0005c000 tural models to analyze various factors operand locations. For some systems,
Memory address
00064000
00068000 within the effective CPI performance such profiling helps you understand
0006b000
00000006
equation. These tools are typically the required amount of RAM as well
00079000 high-level C language models of cer- as which variables to map into this
00000007
0008d000 tain functions within the design and space to maximize performance.
0009c000 are driven with information from the Of prime importance is verification
000a7000
000d2000 ColdFire ISA simulator or trace data. of the architectural models. So, at
000d6000
000de000
The ISA model is a C-language various times throughout the analysis
000e1000 program that defines the expected process, the accuracy of the architec-
000eb000
000ee000 results of execution of the instruction tural models is validated.
000f1000
000f4000
set architecture. By inputting a mem- The V.2 processor pipeline archi-
ory image file, the ISA model executes tectural model was initially verified
the program on an instruction-by- by comparing predicted base-CPI val-
Figure 4—The operand address histogram is taken instruction basis, updating all program- ues versus those directly measured
from a set-top box application. visible machine registers and memory from silicon. Reviewing measured
as required. This ISA model is instru- base-CPI values versus those predicted
define the memory subsystem factors mented to optionally output informa- by the pipeline model, the error was
as those associated with a cache tion on instruction fetch, operand less than a 0.5% difference across a
memory, and assume the remaining addresses, and program counter values. large set of embedded benchmarks.
system factors are negligible. By executing the target application The cache architectural models were
The effective CPI equation can on the ISA simulator with the appro- validated against the design descrip-
then be rewritten as in Figure 2c. The priate outputs enabled, a stream of data tions for several ColdFire MPU designs.
first degradation term quantifies the from the executing application can be Another area of interest is the
CPI contribution due to instruction input to one of the architectural models. modeling of the {IF,OP}_stall
fetch cache misses, and the second This input data provides the required times. These degradation factors rep-
term quantifies the operand reference stimulus to the architectural models. resent the pipeline stall that occurs on
cache misses. Processor pipeline models are used a cache miss. For the nonblocking
The relative performance between for base CPI analysis. There’s also a streaming cache designs of the V.2 and
two systems, x and y, can be expressed program that gathers detailed statistics V.3 cores, these terms are modeled as:
as: about dynamic opcode usage. Recall-
ing the base CPI equation, this program {IF, OP}_stall =
provides the F(i) factors associated with (1 + t1) + 1.0 + 0.6 × (t2 + t3 + t4)
x performance the various opcodes for the application.
=
y performance Memory Relative Relative
y eff CPI y cycle time y executed insts ADDITIONAL ANALYSIS MODELS
× × Configuration performance area
x eff CPI x cycle time x executed insts
The ColdFire cache model quantifies
2-KB cache 1.00 1.00
numerous performance parameters for +4-KB RAM 1.05 1.19
where the first ratio defines the archi- various cache sizes, associativity, and 4-KB cache 1.19 1.11
tectural factor, the second ratio is the organizations. It uses the stream of +4-KB RAM 1.27 1.31
technology factor, and the third ratio reference addresses generated by the 8-KB cache 1.52 1.32
+4-KB RAM 1.61 1.52
is the instruction set/compiler factor. simulator as input, and models the 16-KB cache 1.98 1.79
Using the system performance behavior of Harvard and unified caches +4-KB RAM 2.06 1.98
equation, you can analyze the relative of sizes from 512 bytes to 32 KB. 32-KB cache 2.71 2.71
performance of different generations Additionally, the associativity can +4-KB RAM 2.91 2.91
of a microprocessor family, or compare vary between two-way and four-way, Table 1—Here’s the relative performance and area for
different architectures. For benchmarks and the operands can be mapped into various ColdFire configurations executing a set-top box
where the same binary code image is copyback or store-through space. This application
Performance/Area
2.50 1.00
16 KB 16 KB
Performance
2.00 0.80
from the micropro- 8 KB 0.60 8 KB
S iz
e To see how the per-
1.50
he
cessor pins. 1.00 Si
ze 0.40
4 KB C
ac formance analysis pro-
4 KB he
Using the equa- 0.50 C
ac 0.20
cedure works, consider
0.00
tion in Figure 2d for 0.00
0 2 KB
0
4 KB
2 KB
the following real-
4 KB RAM Size
the V.2 and V.3 RAM size
world examples.
designs, the relative To begin, let’s say
error between the you are implementing a
Figure 5a—This graph depicts the relative performance as a function of cache and RAM sizes. b—By
predicted and mea- contrast, this graph shows the relative performance per area as a function of cache and RAM sizes. digital set-top box. By
sured effective CPI instrumenting an exist-
was less than 2% across a wide suite in local RAMs or ROMs. The pipeline ing 68k system, trace data is captured
of embedded benchmarks. model produces the base CPI perfor- for two critical execution paths.
Figure 3 summarizes the process. mance metric for a given version of The challenge is to determine the
The architectural models are driven the ColdFire microarchitecture. appropriate amount of local processor
by trace data captured from existing The local-memory models deter- memories (cache and possibly RAM)
hardware or from a compiled applica- mine all the performance parameters to optimize price and performance for
tion executed on the instruction set associated with the cache, RAM, and a V.3 ColdFire design. When imple-
simulator. The resulting streams of ROM modules. The miss ratios are mented in 0.35-µm process technology,
addresses and instructions are then based on size, organization, and the the V.3 core provides 70-Dhrystone,
input to the specific models. dynamic stream of reference addresses. 2.1-MIPS performance when operating
The profiling tool determines any The base CPI and memory parameters at 90 MHz.
hot spots in the code or data areas are combined to produce an effective The trace data is profiled to identify
that might be considered for placement CPI value that provides an accurate any potential hot spots that might
benefit from placement in a RAM.
The profile in Figure 4 shows several
spikes representing heavily referenced
operand areas.
The largest reference area is gener-
ally the system stack and the first
candidate for mapping into a local
RAM. Using the architectural models,
the relative performance and area
calculations across a range of cache
and RAM configurations are given in
Table 1. The reference design is a V.3
core with 2 KB of cache memory.
In Table 1, the relative performance
ranges from 1.0× to 2.6× as a function
of local memory configurations with a
corresponding relative area of 1.0–2.9×.
Depending on system requirements,
the appropriate configuration can be
selected, as shown in Figure 5.
In the second example, a customer
provides a C-language benchmark that
represented four execution paths in a
servo control application. In this real-
time application targeted for a V.2
core, absolute performance in response
to certain interrupts was critical.
There was a fixed amount of time
to service the interrupt and the algo-
rithm implemented a number of digital