I. I NTRODUCTION
The memory subsystem (memory controller, bus, and DRAM)
is becoming a performance bottleneck in many computer systems.
The continuing fast pace of growth in the number of cores on
a die places an increasing demand for data bandwidth from the
memory subsystem [1]. In addition, the memory subsystems share
of power consumption is also increasing, partly due to the increased
bandwidth demand, and partly due to scarce opportunities to enter
a low power mode. Thus, optimizing the memory subsystem is
becoming increasingly critical in achieving the performance and
power goals of the multicore system.
Optimizing the design of the multicore processors memory subsystem requires good understanding of the representative workload.
Memory subsystem designers often rely on synthetically generated
traces, or workload-derived traces (collected by a trace capture
hardware) that are replayed. The trace may include a list of memory
accesses, each identified by whether it is a read/write, address that
is accessed, and the time the access occurs. Experiments with these
traces lead to decisions about memory controller design (scheduling
policy, queue size, etc.), memory bus design (clock frequency,
width, etc.), and the DRAM-based main memory design (internal
organization, interleaving, page policy, etc.).
This conventional method of relying on traditional traces face
two major challenges. First, many software users are apprehensive
about sharing their code (source or binaries) due to the proprietary
Ganesh Balakrishnan
Memory System Architecture and Performance
Advanced Micro Devices
ganesh.balakrishnan@amd.com
40000
Fast DRAM
30000
20000
10000
5
1
27
53
79
105
131
157
183
209
235
261
287
600000
4
3
Slow DRAM
400000
w/ feedback
w/o feedback
2
1
200000
0
1
27
53
79
105
131
157
183
209
235
261
287
(a)
1.5ns
6ns
12ns
(b)
Power (watts)
Figure 1.
Missing feedback loop influences the accuracy of trace
simulation. Inter-arrival time distribution with fast vs. slow DRAM (a),
and DRAM power consumption with vs. without the feedback loop. Full
evaluation parameters are shown in Table II.
3.5
3
2.5
2
1.5
1
0.5
0
orig
geo
uniform
Bandwidth (GB/s)
orig
geo
uniform
A uniform inter-arrival (not to be confused with constant interarrival) is one where the inter-arrival time values are randomly
generated with equal probabilities. A geometric inter-arrival follows
the outcome of a series of Bernoulli trials over time. A Bernoulli
trial is one with either a successful outcome (e.g. whether a
memory reference results in a Last Level Cache/LLC miss) or
a failure. A series of Bernoulli trials result in a stream of LLC
misses reaching the memory subsystem, with their inter-arrival
time following a geometric distribution. A Bernoulli process can
be viewed as the discrete-time version of Poisson arrival in the
M/*/1 queueing model, which is widely used in modeling queueing
performance. We generate traces with uniform and geometric interarrival and compare them against the original application in terms
of DRAM power and bandwidth in Figure 2, assuming 667MHz
DDR3 DRAM. The Figure shows that uniform and geometric interarrival do not represent the original benchmarks behavior well
across most benchmarks. While not shown, we also observe similar
mismatch on other metrics such as the average queue length of
memory controller and the average bus queueing delay.
In this paper, we propose Memory Traffic Timing (MeT2 , or
MRU LRU
Proc
Cache
Hierarchy
Initial content
MeToo
Fixed
App
Profiler
Clone
Generator
Clone HW/
Sim
Profiling
MC
DRAM
Focus of
design
exploration
Trace Clone
Perf
results
A
B
A
C
MRU LRU
Figure 3.
C
Case 2: two future write-backs
st
ld
st
ld
C
A
A
C
MRU LRU
that misses in the LLC will generate a read request to the memory
subsystem, assuming the LLC is a write back and write allocate
cache. Second, a block write back generates a write request to
the memory subsystem. Finally, hardware prefetches generate read
requests to the memory subsystem. Note that our MeToo clones
must capture the traffic that correspond to all three types of
requests but only one type of request is generated by execution
of instructions on the processor. Thus, a MeToo clone can only
consist of instructions that will be executed on the same processor
model, yet a challenge is that it must capture and generate all three
types of requests. Let us discuss how MeToo handles each type of
requests.
The first type of requests MeToo must capture faithfully is
memory traffic that results from loads/stores generated by the
processor that produce read memory requests. To achieve that, we
capture the per-set stack distance profile [15] at the last level cache.
Let Sij represent the percentage of accesses that occurs to the
ith cache set and j th stack position, where the 0th stack position
represents the most recently used stack position, 1st stack position
represents the second most recently used stack position, etc. Let Si
denote the percentage of accesses that occurs to the ith set, summed
over all stack positions. These two statistics alone provide sufficient
information to generate a synthetic memory trace that reproduces
the first type of requests (read/write misses), assuming a stochastic
clone generation is utilized. These are the only statistics that are
similar to the ones used in West [9]. All other statistics discussed
from hereon are unique to MeToo.
To deal with the second type of memory requests (i.e. ones that
occur due to LLC write backs), we found that Wests statistics are
insufficient. West captures the fraction of accesses that are writes
in each set and stack distance position. However, the number of
write backs is not determined by the fraction of accesses that
are writes. Instead, it depends on the number of dirty blocks.
Wests write statistics does not capture the latter. To illustrate the
difference, Figure 4 shows two cases that are indistinguishable in
West, but contribute to different number of write backs. The figure
shows a two-way cache set initially populated with clean blocks
A and C. Case 1 and 2 show four different memory accesses that
produce indistinguishable results in Wests write statistics: 50% of
the accesses are writes, half of the writes occur to the MRU block
and the other half to the LRU block. In both cases, the final cache
content includes block A and C. However, in Case 1, only one
Number of instances
50000
40000
30000
20000
10000
0
-10000
100
200
300
400
500
ICI size
Figure 5.
bzip2.
block is dirty, while in Case 2, both cached blocks are dirty. This
results in a different number of eventual write backs. The difference
in the two cases is simple: a store to a dirty block does not increase
the number of future write backs, while a store to a clean block
increases the number of future write backs. In order to distinguish
the two cases, MeToo records W Cij and W Dij , representing the
probability of writes to clean and dirty blocks respectively, at the
ith cache set and j th stack position.
In addition to this, we also record the average dirty block ratio
(fraction of all cache blocks that are dirty) fD and write ratio
fW (fraction of all requests to the memory subsystem that are
write backs). The clone generator uses these two values to guide
the generation of load and store for the clone. For example, the
generator will generate more store instructions when the dirty
block ratio is found to be too low during the generation process.
The generator will try to converge to all these statistics during
generation process.
The final type of requests is one due to hardware-generated
prefetches. The LLC hardware prefetcher can contribute significantly to memory traffic. One option to deal with this is to ensure
that the real or simulated hardware includes the same prefetcher as
during profiling. However, such an option cannot capture or reproduce the timing of prefetch requests. Thus, we pursue a different
approach, which is to gather , the statistics of the percentage
of traffic-generation instructions that trigger hardware prefetches.
Then, we generate clones with software prefetch requests inserted
in such a way that we match . In the design exploration runs,
hardware prefetchers at the LLC can be turned off.
B. Capturing the Timing of Memory Requests
In the previous section, we discussed how we can generate
memory requests corresponding to the read, write, and prefetch
memory requests of the original applications. In this section, we
will investigate how the inter-arrival timing of the requests can be
captured in MeToo clones.
The first question is how the inter-arrival timing information
between memory traffic requests should be represented. One choice
is to capture the inter-arrival as the number of processor clock
cycles that elapsed between the current request and the one
immediately in the past. Each inter-arrival can then be collected in
a histogram to represent the inter-arrival distribution. One problem
with this approach is that the number of clock cycles is rigid and
cannot be scaled to take into account the memory performance
feedback loop.
Thus, we pursue an alternative choice to measure the interarrival time in terms of the number of instructions that separate
consecutive memory requests. We refer to this as Instruction Count
Interval (ICI). An example of ICI profile is shown in Figure 5.
The x-axes shows the number of instructions that separate two
consecutive memory requests (i.e. ICI size), while the y-axes
shows the number of occurrences of each ICI size. The x-axes is
truncated at 500 instructions. In the particular example in the figure,
there are many instances of 20-40 instructions separating two
consecutive memory requests, indicating a high level of Memory
Level Parallelism (MLP).
The ICI profile alone does not account for the behavior we
observe where certain ICI sizes tends to recur. ICI size recurrence
can be attributed to the behavior of loops. To capture the temporal
recurrence, we collect ICI recurrence probability in addition to the
ICI distribution. The ICI sizes are split into bins where each bin
represents similar ICI sizes. For each bin, the probability of the
ICI sizes in that bin to repeat is recorded.
Together the ICI distribution and recurrence probability profile
summarize the inter-arrival timing of memory subsystem requests.
However, they still suffer from two shortcomings. First, the number
of instructions separating two memory requests may not be linearly
proportional to the number of clock cycles separating them. Not all
instructions are equal or take the same amount of execution time
on the processor. Second, the statistics are not capable of capturing
the feedback loop yet. We will deal with these shortcomings next.
C. Capturing the Timing Feedback Loop
The ICI distribution and recurrence profiles discussed above may
not fully capture the timing of inter-arrival of memory subsystem
requests due to several factors. First, in an out-of-order processor
pipeline, the amount of Instruction Level Parallelism (ILP) determines the time it takes to execute instructions: higher ILP allows
faster execution of instructions, while lower ILP causes slower
execution of instructions.
One of the determinants of ILP is data dependences between
instructions. For a pair of consecutive (traffic-generating) memory
request instructions, the ILP of the instructions executed between
them will determine how soon the second traffic-generating instruction occurs. Figure 6(a) illustrates an example of two cases
of how dependences can influence inter-arrival pattern of memory
requests. In the figure, each circle represents one instruction. Larger
circles represent traffic-generating instructions while smaller circles
represent other instructions such as non-memory instructions, or
loads/stores that do not cause memory traffic requests. The arrows
indicate true dependence between instructions. Let us suppose that
the issue queue has 4 entries. In the first case, instructions will
block the pipeline until the first memory request is satisfied. The
second traffic-generating instruction needs to wait until the pipeline
is unblocked. In the second case, while the first traffic-generating
instruction is waiting, only one non-traffic instruction needs to wait.
The other instructions will not block the pipeline and the second
traffic-generating instruction can be issued much earlier since it
does not need to wait for the first request to be satisfied. The figure
emphasizes the point that although both cases show the same ICI
size and the same number of instructions that have dependences,
the inter-arrival time of memory requests may be very different in
the two cases.
Legend:
T
T
Traffic-generating
instruction
Other instruction
True dependence
(a)
T
(b)
Figure 6. An example of how the inter-arrival timing of memory requests
is affected by dependences between traffic-generating memory instructions
and other instructions (a) or between traffic-generating memory instructions
themselves (b).
Table I
S TATISTICS THAT ARE COLLECTED FOR M E T OO .
Notation
What It Represents
stack distance profile probability at the ith set and
j th stack position
stack distance profile probability at the ith set
probability of write to clean block at the ith set and
j th stack position
probability of write to clean block at the ith set and
j th stack position
fraction of LLC blocks that are dirty
fraction of memory requests that are write backs
recurrence probability of ICI of size k
dependence ratio (number of dependent instructions
divided by number of independent instructions) for
ICI of size k
the number of instructions of type m in an ICI of size
k.
probability the next memory request is to the same
region of size r as the current memory request.
occurrences of dependence distance over all ICIs of
size k
probability that a pair of traffic-generating instructions
exhibit a true dependence relation
the fraction of instructions that are memory instructions
percentage of traffic-generating instructions in ICI of
size k that triggers prefetches (single instance and
averaged over all ICIs of size k)
Sij
Si
W Cij
W Dij
fD
fW
Reck
Rk
km
N
Pr
Dk
Mk
fmem
k and k
gem5
Benchmark
Core
DRAMSim2
Cache
Structure
DRAM
structure
simulator
profiler
inst. pattern
profiler
Figure 7.
locality profiler
cache access
profiler
cache-hit
Step1
Step3
Step2
Step4
Start
Generate
non-memory
instructions
traffic-gen
Step5
Determine the
address, registers,
and load/store type
for the traffic-gen
inst. Choose next ICI.
Generate
non-memory
instructions
Step6
Target number of
inst reached?
Yes
End
No
Figure 8.
be within the region of size r from the previous trafficgenerating instruction. For registers, use Mk to determine
if this instruction should depend on the previous trafficgenerating instruction. Determine load vs. store based on
W Cij for the case where j is larger than the cache associativity. fD and fW are also consulted as needed. Finally,
generate this instruction, then choose a new ICI size based
on the ICI recurrence probability and ICI size probability
distribution, and continue to Step 5.
5) Generate at least one non-memory instructions, based on the
number of non-memory to memory instruction ratio of the
original benchmark. Determine each non-memory instruction
km , and determine the registers of each
type by using N
instruction using Rk and Dk . Go to Step 6.
6) Determine if the expected number of instructions has been
generated. If so, stop the generation. If no instruction can be
generated any more, also stop the generation. Otherwise, go
to Step 2.
These steps will be run in a loop until the desired number of
traffic instruction pairs are generated. Since we incorporate strict
convergence approach, there is probability that not all statistics
fully converge. In this case, the final number of instructions may be
slightly fewer than the target number of instructions. The generated
instructions are implemented as assembly instructions, after that
they are bundled into a source file that can be compiled to generate
an executable clone.
VI. VALIDATION AND D ESIGN E XPLORATION M ETHOD
To generate and evaluate the clones, we use a full-system
simulator gem5 [16] to model the processor and cache hierarchy,
and DRAMSim2 [17] to model the DRAM memory subsystem
and the memory controller. The profiling run is performed with
a processor and cache hierarchy model with parameters shown in
Table II. The L2 cache is intentionally chosen to be relatively small
so that its cache misses will stress test MeToos cloning accuracy.
We use benchmarks from SPEC2006 [18] for evaluation. We
fast forwarded each benchmark to skip the initialization stage and
clone one particular phase of the application consisting of 100
million instructions. (to capture other phases, we can choose other
100 million instruction windows as identified by Simpoint [19])
The clone that we generate contains 1 million memory references,
yielding a clone that is roughly 20 50 smaller than the original
application, allow much faster simulation. For example, the original
benchmarks usually run for several hours while the clones finish
running in several minutes. The clone generation itself typically
takes a couple of minutes.
150000
200000
100000
60000
gcc_orig
250000
libquantum_orig
50000
10000
50000
0
3500
3000
2500
2000
1500
1000
500
0
100
200
300
0
2500
libquantum_clone
100
200
300
1500
1000
500
0
100
30000
200
300
100
500000
gromacs_orig
25000
200
5000
100000
100
500
200
300
300
200
100
0
100
200
300
Figure 9.
100
200
200
300
100
200
300
-50000
300
100
500
100
100
200
200
300
10000
20000
5000
10000
100
200
300
zeusmp_clone
gcc
300
200
100
0
300
100
200
300
bzip2
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
20
301
51
101
gromacs
151
201
251
301
51
101
lbm
151
201
251
301
51
milc
101
151
201
251
301
100
100
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
20
101
151
201
251
301
gcc_astar
51
101
151
201
251
301
51
gromacs_libquantum
101
151
201
251
301
51
bzip2_gcc
101
151
201
251
301
100
100
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
0
51
101
151
201
251
301
51
mcf_libquantum
101
151
201
251
301
gcc_zeusmp
51
101
151
201
251
301
libquantum_hmmer
51
101
151
201
251
301
100
100
80
80
80
80
80
60
60
60
60
60
40
40
40
40
40
20
20
20
20
20
101
151
201
251
301
51
101
151
201
251
301
51
101
151
201
251
301
251
301
51
101
151
201
251
301
51
101
151
201
51
101
151
201
251
301
astar_gromacs
100
51
201
orig
clone
hmmer_zeusmp
100
151
20
1
100
101
milc_lbm
100
51
mcf_lbm
100
300
hmmer
100
51
zeusmp
100
200
astar
80
100
mcf
100
251
300
400
100
201
200
hmmer_clone
500
100
151
100
600
100
101
300
Comparison of inter-arrival timing distribution of memory requests of the original applications vs. their clones.
libquantum
51
200
hmmer_orig
30000
100
100
40000
zeusmp_orig
350
300
250
200
150
100
50
0
milc_clone
300
300
200
astar_clone
1500
1000
15000
200
100
2000
1000
20000
milc_orig
3500
3000
2500
2000
1500
1000
500
0
lbm_clone
200
2000
100
100
mcf_clone
300
50000
7000
6000
5000
4000
3000
2000
1000
0
gromacs_clone
400
0
4000
100000
300
150000
200000
10000
200
20000
3000
200000
300000
15000
100
bzip2_clone
300
lbm_orig
400000
20000
0
700
600
500
400
300
200
100
0
gcc_clone
2000
40000
100000
20000
astar_orig
60000
150000
30000
100000
80000
mcf_orig
200000
40000
150000
50000
250000
bzip2_orig
50000
251
301
orig
clone
1
51
101
151
201
251
301
Figure 10. Cumulative distribution of inter-arrival times of memory requests of benchmarks vs. clones, in standalone (top two rows) and co-scheduled
modes (bottom two rows).
Table II
BASELINE HARDWARE CONFIGURATION USED FOR PROFILING AND
CLONE GENERATION .
Core
Cache
Bus
DRAM
Table III
H ARDWARE CONFIGURATIONS USED FOR MEMORY DESIGN
EXPLORATION .
Parameter
Core
Cache
Bus
DRAM
Pipeline width
IQ size
L2 prefetcher
Width
Speed
Cycle time
Address mapping
Variation
2, 4, 8, 16
8,16, 32, 64, 128
Stride w/ degree 0, 2, 4, 8
1, 2, 4, 8, 16 Bytes
100 MHz, 500 MHz, 1 GHz
1.5ns, 3ns, 6ns, 12ns
chan:row:col:bank:rank
chan:rank:row:col:bank
chan:rank:bank:col:row
chan:row:col:rank:bank
A. Validation
In order to validate MeToo clones, we collect inter-arrival time
distribution statistics, shown in Figure 9. The first and third row of
the charts show the inter-arrival time distribution for the original
benchmarks, while the second and fourth row of the charts show
the inter-arrival time distribution for MeToo clones. The x-axes
shows the inter-arrival time (in bus clock cycles), while the yaxes shows the number of occurrences of each inter-arrival time.
To improve legibility, we aggregate inter-arrival times larger than
or equal to 300 clock cycles into one single bar, and smooth all
inter-arrival times smaller than 300 cycles with a 5-cycle window
moving average. The figure shows that for most benchmarks, there
are inter-arrival times that are dominant. There is a significant
variation between benchmarks as well. Some benchmarks such as
libquantum, lbm, and milc, have mostly small inter arrival times
(less than 50 cycles). Others, such as bzip2, mcc, zeusmp, hmmer,
have significant inter-arrival times between 50-200 cycles. Some
benchmarks have inter-arrival times larger than 200, such as gcc,
gromacs, hmmer, and astar.
The figure also shows that MeToo clones are able to capture
the dominant inter-arrival times in each benchmark with surprising
accuracies. The dominant inter-arrival times are mostly replicated
very well, however some of them tend to get a little bit diffused.
For example, in libquantum, the two dominant inter-arrival times
are more diffused in the clone. Small peaks in the inter-arrival
times may also get more diffused. For example, the small peak at
about 80 cycles for mcf and milc are diffused. The reason for this
diffusion effect is that MeToo clones are generated stochastically,
hence there is additional randomness introduced in MeToo that
was not present in the original benchmarks. However, the diffusion
effect is mild. Overall, the figure shows that MeToo clones match
the inter-arrival distribution of the original benchmarks well. Recall
that what MeToo captures is inter-arrival statistics in terms of ICI
size, rather than number of clock cycles. The good match in interarrival timing shows that MeToo successfully translates the number
of instructions in the intervals into number of clock cycles in the
intervals.
To get a better idea of how well MeToo clones and the original benchmarks fit, we super impose the cumulative probability
distribution of the inter-arrival timing statistics of the clones and
benchmarks in Figure 10. The x-axes shows the inter-arrival time
in clock cycles, while the y-axes shows the fraction of all instances
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
20
15
orig
clone
10
5
0
hmmer
zeusmp
bzip2
gcc
mcf
libquantum
astar
milc
gromacs
lbm
gcc_astar
gromacs_libquantum
bzip2_gcc
mcf_lbm
milc_lbm
mcf_libquantum
gcc_zeusmp
libquantum_hmmer
hmmer_zeusmp
astar_gromacs
hmmer
zeusmp
bzip2
gcc
mcf
libquantum
astar
milc
gromacs
lbm
gcc_astar
gromacs_libquantum
bzip2_gcc
mcf_lbm
milc_lbm
mcf_libquantum
gcc_zeusmp
libquantum_hmmer
hmmer_zeusmp
astar_gromacs
4
3.5
3
2.5
2
1.5
1
0.5
0
hmmer
zeusmp
bzip2
gcc
mcf
libquantum
astar
milc
gromacs
lbm
gcc_astar
gromacs_libquantum
bzip2_gcc
mcf_lbm
milc_lbm
mcf_libquantum
gcc_zeusmp
libquantum_hmmer
hmmer_zeusmp
astar_gromacs
hmmer
zeusmp
bzip2
gcc
mcf
libquantum
astar
milc
gromacs
lbm
gcc_astar
gromacs_libquantum
bzip2_gcc
mcf_lbm
milc_lbm
mcf_libquantum
gcc_zeusmp
libquantum_hmmer
hmmer_zeusmp
astar_gromacs
4
3.5
3
2.5
2
1.5
1
0.5
0
Figure 11. Comparing results obtained by the original benchmarks vs. clones in terms of DRAM power , aggregate DRAM bandwidth, memory controllers
scheduling queue length, and bus queueing delay.
in each case. The figure shows MeToo clones errors are only a
fraction of errors (at most 35%) from using geometric or uniform
inter-arrival. This indicates that geometric/Poisson arrival pattern
assumed in most queueing models is highly inaccurate for use in
memory subsystem performance studies.
1.2
1
clone
geo
uniform
power
queue delay
0.8
0.6
0.4
0.2
0
bandwidth
queue size
Bus width
Bus speed
20
20
15
15
15
10
10
10
16
1G
500M
100M
IQ size
20
15
10
5
0
1.5
Prefetching degree
20
20
15
15
15
10
10
10
Figure 13.
64
32
12
Pipeline width
20
128
20
16
16
geo
clone
0
The fitting errors of MeToo vs. geometric inter-arrival clones, across various hardware configurations.
ACKNOWLEDGEMENT
This research was supported in part by NSF Award CNS0834664. Balakrishnan was a PhD student at NCSU when he
contributed to this work. We thank anonymous reviewers and Kirk
Cameron for valuable feedback for improving this paper.
R EFERENCES
[1] B. Rogers et al., Scaling the Bandwidth Wall: Challenges in and
Avenues for CMP Scaling, in Proc. of the 36th International
Conference on Computer Architecture, 2009.
[2] A. Joshi, L. Eeckhout, and L. John, The return of synthetic
benchmarks, in Proc. of the 2008 SPEC Benchmark Workshop,
2008.
[3] R. H. Bell, Jr and L. K. John, Efficient power analysis using
synthetic testcases, in Proc. of the International Workload Characterization Symposium. IEEE, 2005.
[4] A. M. Joshi et al., Automated microprocessor stressmark generation, in Proc. of the 14th International Symposium on High
Performance Computer Architecture, 2008.
[5] K. Ganesan, J. Jo, and L. K. John, Synthesizing memory-level
parallelism aware miniature clones for spec cpu2006 and implantbench workloads, in Proc. of the International Symposium on
Performance Analysis of Systems and Software, 2010.
[6] K. Ganesan and L. K. John, Automatic generation of miniaturized synthetic proxies for target applications to efficiently design
multicore processors, IEEE Transactions on Computers, vol. 99,
p. 1, 2013.
[7] E. Deniz, A. Sen, J. Holt, and B. Kahne, Using software
architectural patterns for synthetic embedded multicore benchmark development, in Proc. of the International Symposium on
Workload Characterization, 2012.
[8] E. Deniz, A. Sen, B. Kahne, and J. Holt, Minime: Pattern-aware
multicore benchmark synthesizer, Computers, IEEE Transactions
on, vol. 64, no. 8, pp. 22392252, Aug 2015.
[9] G. Balakrishnan and Y. Solihin, West: Cloning data cache behavior using stochastic traces, in Proc. of the 18th International
Symposium on High Performance Computer Architecture ), 2012.
[10] A. Awad and Y. Solihin, STM: Cloning the spatial and temporal
memory access behavior, in Proc. of the 20th International
Symposium on High Performance Computer Architecture, 2014.
[11] M. Badr and N. Enright Jerger, SynFull: Synthetic traffic models
capturing cache coherent behaviour, in Proc. of the International
Symposium on Computer Architecture, 2014.
[12] Z. Kurmas and K. Keeton, Synthesizing representative I/O workloads using iterative distillation, in Proc. of the International
Symposium on Modeling, Analysis and Simulation of Computer
Telecommunications Systems, 2003.
[13] C. Delimitrou, S. Sankar, K. Vaid, and C. Kozyrakis, Storage I/O
generation and replay for datacenter applications, in Proc. of the
International Symposium on Performance Analysis of Systems and
Software, 2011.
[14] G. Balakrishnan and Y. Solihin, MEMST: Cloning memory
behavior using stochastic traces, in Proc. of the International
Symposium on Memory Systems, 2015.
[15] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger, Evaluation
Techniques for Storage Hierarchies, IBM Systems Journal, 9(2),
1970.
[16] N. Binkert et al., The gem5 simulator, SIGARCH Comput.
Archit. News, vol. 39, no. 2, pp. 17, 2011.
[17] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, DRAMSim2: A
cycle accurate memory system simulator, Computer Architecture
Letters, vol. 10, no. 1, pp. 16 19, jan.-june 2011.
[18] J. L. Henning, Spec cpu2006 benchmark descriptions,
SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 117, Sep.
2006.
[19] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and
B. Calder, Using simpoint for accurate and efficient simulation,
in Proc. of the International Conference on Measurement and
Modeling of Computer Systems, 2003.