|
Hierarchical Model
In any computer system, cache memory is separated logically and physically
for so-called levels. For example, if to take into consideration an abstract
machine with 32Kb of internal (built into the processor core) and 1Mb of
external (located on either the processor module or the mainboard) cache memory,
the first one may be called cache memory of the 1st level while the second one
— of the 2nd level. Modern computer systems may accommodate up to four
cache levels, though two-level organisation remains the most popular. Cache
memory of the 1st level is subdivided usually for instruction cache (I-cache)
and data cache (D-cache) — that is so-called Harvard architecture.
Although I-cache and D-cache are of the same size usually, it isn't mandatory.
At the same time, they are integrated into processor cores almost always because
of performance reasons. There are not so many opposite examples, and most of
them are representatives of the PA-RISC architecture. Hewlett-Packard PA-7000
utilises external I-cache and D-cache of 256Kb each, PA-7100 and PA-7150
— of 1Mb and 2Mb respectively, PA-8000 — of 1Mb each, and PA-8200
— of 2Mb each. However, all these processors have never featured high
clock speeds, so it has been of no trouble to run their I-cache and D-cache at
full core clock speed with tolerable access latencies. Nevertheless, these
processors have delivered very good performance in the past. If to take SPEC95
benchmarks into consideration, then a HP Visualize C200 workstation with a
200MHz PA-8000 inside could be found equivalent to a DEC Personal WorkStation
600au armed with a 600MHz Alpha 21164A in means of floating-point performance
(21.4 versus 21.3), though lagged behind for about 20% in means of integer
performance (14.3 versus 18.4). It doesn't mean at all that a particular
architecture is better or worse than another, it's to remind that core clock
speed and cache organisation are just two factors among many which define
performance of any hardware implementation.
There are several reasons to explain why split caches prevail over unified
(U-cache) on the 1st level. Different functional units fetch information from
I-cache and D-cache: decoder and scheduler (I-box) operate with I-cache, but
integer execution unit (E-box), also referred as arithmetical and logical unit,
and floating-point unit (F-box) communicate with D-cache. There are also
load/store unit (A-box) with cache and system bus controller (C-box), which are
involved directly in operations with caches. By the way, every functional unit
consists of one or several execution pipelines usually. I-cache and D-cache
operate with very low access latencies because their increase would cause a
serious performance loss on most tasks. In order to maintain them low, cache
size is sacrificed and figures into values from 8Kb to 64Kb usually. That's not
an easy task to place a large cache on a silicon die and to assure its proper
syncronisation. If a particular cache gets larger in size while keeping its
internal organisation intact, it would inevitably take more time to search
through the cache for some information and to transfer what has been looked for
to the output. Apart of that, more pipelines of functional units utilise a
particular cache, more access ports are required to satisfy them, while adding
new ones is a pretty stiff job (shall be told in depth about cache ports
later). Additionally, U-cache is affected by other drawbacks. For example, all
data must be evicted while flushing instructions from the cache. This situation
occurs usually on various system exceptions which intend to flush processor
pipelines and restart them at a new address. Although it's a regular practice to
flush I-cache with virtual tagging on every task switch as well. On the other
hand, U-cache allows for a more effective utilisation of itself, i. e.
there is a variable proportion between instructions and data contained with
dependence on a task being executed.
Cache memory of the 2nd level is unified almost always, though there are
several well-known exceptions to name. For instance, HARP-1 (Hitachi Advanced
RISC Processor) of the PA-RISC v1.1 architecture contained 8Kb I-cache and 16Kb
D-cache which were backed up by external instruction and data caches of 512Kb
each. Forth, the design of SPARC64 V by HAL Computer (not that one by Fujitsu
which went into production under the same name) featured 32Kb I-cache (expanded
with a 1024-entry trace cache) and 8Kb D-cache which were supported by
integrated instruction and data caches of 256Kb and 512Kb respectively; an
external unified cache of up to 64Mb was also employed. There is Intel
Itanium 2 (that one code-named as Montecito) among most recent examples.
Every of its two cores manages dedicated 16Kb I-cache and 16Kb D-cache as well
as integrated instruction and data caches of the 2nd level sized at 1Mb and
256Kb respectively, also an integrated unified cache of the 3rd level sized at
12Mb. The primary reason of going unified is because cache memory of the 2nd
level doesn't need to be as fast as of the 1st level. In real life, I-cache and
D-cache satisfy about 80-90% of all memory requests usually, so it makes room
for a trade-off: caches of the lower levels may feature higher access and
delivery latencies in exchange for larger sizes and more associativity ways to
improve hit rates. If cache memory of the 2nd level gets integrated into
processor core, it may be called S-cache (secondary cache). If a particular
processor employs an integrated cache of the 3rd level, it may be called T-cache
(ternary cache). If there is an external cache which consists of regular
discrete static memory chips most likely, it may referenced as B-cache (back-up
cache). In a matter of fact, B-cache happens to be the last level of cache
hierarchy. Unlike integrated caches, B-cache could be driven either by
processor's C-box or by system logic or even by both of them. In general, every
next cache level is larger but slower than the previous one.
Hierarchy of cache levels could be traced to the best during processor's
run-time. If there is no datum available in a register while executing some
instruction, a request has to be generated and dispatched to the nearest cache
level, i. e. to D-cache. If a miss returns, the request goes redirected
to the next cache level and so forth. In the worst case, the datum will be
delivered directly from the operating memory. Although the arrival could be
delayed even more if the datum has been pushed out previously by the virtual
memory subsystem to a swap file (or swap partition for that matter), i. e.
to a hard disk drive. It takes from tens to hundreds processor cycles to receive
a necessary register-wide quantum of information from operating memory, but when
it comes to hard disk drives the count may be for hundreds thousand or millions
cycles.
|