You are designing a write buffer between a write-through L1 cache and a write-back L2 cache. The L2 cache write data bus is 16 B wide and can perform a write to an independent cache address every 4 processor cycles.
a. How many bytes wide should each write buffer entry be?
b. What speedup could be expected in the steady state by using a merging write buffer instead of a non-merging buffer when zeroing memory by the execution of 64-bit stores if all other instructions could be issued in parallel with the stores and the blocks are present in the L2 cache?
c. What would be the effect of possible L1 misses on the number of required write buffer entries for systems with blocking and non-blocking caches?

Respuesta :

Answer:

Clock 2.5GHz

L1 I cache 32KB, 8way, 64B line size, 4 cycle access latency

L1 Dcache write-back, write-allocate; MSHR with 0 (lockup

cache), 1, 2, and 64 (unconstrained non-blocking

cache) entries, write-back buffer with 16 entries

L2 cache 256KB, 8way, 64B line size, 10 cycle access latency

L3 cache 2MB per core, 64B line size, 36 cycle access latency

Memory DDR3-1600, 90 cycle access latency

Issue width 4

Instruction window size 36

ROB Size 128

Load Buffer Size 48

Store Buffer Size 32

b)

parallelism took this a step further by providing more parallelism and hence more

latency-hiding opportunities. It is likely that the use of instruction- and threadlevel

parallelism will be the primary tool to combat whatever memory delays are

encountered in modern multilevel cache systems.

that of the lockup cache setup (hit-under-0-miss). For the integer programs: the average performance

(measured as CPI) improvement is 7.08% for hit-under-1-miss, 8.36% for hit-under-2-misses, and 9.02%

for hit-under-64-misses (essentially the unconstraint non-blocking cache), compared to lockup cache. For

the floating point programs, the three numbers are 12.69%, 16.22%, and 17.76%, respectively

c)

Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce

miss-induced processor stalls by buffering the misses and continuing to serve other independent access

requests. Previous research on the complexity and performance of non-blocking caches supporting

non-blocking loads showed they could achieve significant performance gains in comparison to blocking

caches. However, those experiments were performed with benchmarks that are now over a decade old.

Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead

capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating

point operations, and write-through and write-no-allocate caches. These assumptions are very different

from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to

re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using

up-to-date benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest

SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors.

Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance

gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in

comparison to a similar machine with a blocking cache.

Explanation: