Beruflich Dokumente
Kultur Dokumente
Milo M. K. Martin1, Daniel J. Sorin2, Bradford M. Beckmann3, Michael R. Marty3, Min Xu4,
Alaa R. Alameldeen3, Kevin E. Moore3, Mark D. Hill3,4, and David A. Wood3,4
http://www.cs.wisc.edu/gems/
1 2 3 4
Computer and Information Electrical and Computer Computer Sciences Dept. Electrical and Computer
Sciences Dept. Engineering Dept. Univ. of Wisconsin-Madison Engineering Dept.
Univ. of Pennsylvania Duke Univ. Univ. of Wisconsin-Madison
1
Appears in Computer Architecture News (CAN), September 2005
Contended Locks
Basic Verification
Random
Tester Simics
Opal
etc.
Detailed
Processor
Model
Microbenchmarks
Ruby M
Memory System Simulator
S I
Interconnection Coherence
Network Controllers
Caches & Memory
Figure 1. A view of the GEMS architecture: Ruby, our memory simulator can be driven by one of
four memory system request generators.
to simulate various system components in different dent on Simics. Such a decoupling allows the tim-
levels of detail. ing models to focus on the most common 99.9% of
While some researchers used the approach of all dynamic instructions. This task is much easier
adding full-system simulation capabilities to exist- than requiring a monolithic simulator to model the
ing user-level-only timing simulators [6, 21], we timing and correctness of every function in all
adopted the approach of leveraging an existing full- aspects of the full-system simulation. For example,
system functional simulation environment (also a small mistake in handling an I/O request or in
used by other simulators [7, 22]). This strategy modeling a special case in floating point arithmetic
enabled us to begin workload development and is unlikely to cause a significant change in timing
characterization in parallel with development of fidelity. However, such a mistake will likely affect
the timing modules. This approach also allowed us functional fidelity, which may prevent the simula-
to perform initial evaluation of our research ideas tion from continuing to execute. By allowing
using a more approximate processor model while Simics to always determine the result of execution,
the more detailed processor model was still in the program will always continue to execute cor-
development. rectly.
We use the timing-first simulation approach Our approach is different from trace-driven
[18], in which we decoupled the functional and simulation. Although our approach decouples
timing aspects of the simulator. Since Simics is functional simulation and timing simulation, the
robust enough to boot an unmodified OS, we used functional simulator is still affected by the timing
its functional simulation to help us avoid imple- simulator, allowing the system to capture timing-
menting rare but effort-consuming instructions in dependent effects. For example, the timing model
the timing simulator. Our timing modules interact will determine the winner of two processors that
with Simics to determine when Simics should exe- are trying to access the same software lock in mem-
cute an instruction. However, what the result of the ory. Since the timing simulator determines when
execution of the instruction is ultimately depen- the functional simulator advances, such timing-
2
Appears in Computer Architecture News (CAN), September 2005
dependent effects are captured. In contrast, trace- vides multiple drivers that can serve as a source of
driven simulation fails to capture these important memory operation requests to Ruby:
effects. In the limit, if our timing simulator was 1) Random tester module: The simplest driver of
100% functionally correct, it would always agree Ruby is a random testing module used to stress test
with the functional simulator, making the func- the corner cases of the memory system. It uses false
tional simulation redundant. Such an approach sharing and action/check pairs to detect many pos-
allows for “correctness tuning” during simulator sible memory system and coherence errors and
development. race conditions [25]. Several features are available
Design Goals. As the GEMS simulation system has in Ruby to help debug the modeled system includ-
primarily been used to study cache-coherent shared ing deadlock detection and protocol tracing.
memory systems (both on-chip and off-chip) and
2) Micro-benchmark module: This driver sup-
related issues, those aspects of GEMS release 1.0 are
ports various micro-benchmarks through a com-
the most detailed and the most flexible. For exam-
mon interface. The module can be used for basic
ple, we model the transient states of cache coher-
timing verification, as well as detailed performance
ence protocols in great detail. However, the tools
analysis of specific conditions (e.g., lock contention
have a more approximate timing model for the
or widely-shared data).
interconnection network, a simplified DRAM sub-
system, and a simple I/O timing model. Although 3) Simics: This driver uses Simics’ functional sim-
we do include a detailed model of a modern ulator to approximate a simple in-order processor
dynamically-scheduled processor, our goal was to with no pipeline stalls. Simics passes all load, store,
provide a more realistic driver for evaluating the and instruction fetch requests to Ruby, which per-
memory system. Therefore, our processor model forms the first level cache access to determine if the
may lack some details, and it may lack flexibility operation hits or misses in the primary cache. On a
that is more appropriate for certain detailed micro- hit, Simics continues executing instructions,
architectural experiments. switching between processors in a multiple proces-
Availability. The first release of GEMS is available sor setting. On a miss, Ruby stalls Simics’ request
at http://www.cs.wisc.edu/gems/. GEMS is open- from the issuing processor, and then simulates the
source software and is licensed under GNU GPL cache miss. Each processor can have only a single
[9]. However, GEMS relies on Virtutech’s Simics, a miss outstanding, but contention and other timing
commercial product, for full-system functional affects among the processors will determine when
simulation. At this time, Virtutech provides evalua- the request completes. By controlling the timing of
tion licenses for academic users at no charge. More when Simics advances, Ruby determines the tim-
information about Simics can be found at ing-dependent functional simulation in Simics
http://www.virtutech.com/. (e.g., to determine which processor next acquires a
memory block).
The remainder of this paper provides an over-
view of GEMS (Section 2) and then describes the 4) Opal: This driver models a dynamically-sched-
two main pieces of GEMS: the multiprocessor uled SPARC v9 processor and uses Simics to verify
memory system timing simulator Ruby (Section 3), its functional correctness. Opal (previously known
which includes the SLICC domain-specific lan- as TFSim[18]) is described in more detail later in
guage for specifying cache-coherence protocols and Section 4.
systems, and the detailed microarchitectural pro- The first two drivers are part of a stand-alone
cessor timing model Opal (Section 4). Section 5 executable that is independent of Simics or any
discusses some constraints and caveats of GEMS, actual simulated program. In addition, Ruby is spe-
and we conclude in Section 6. cifically designed to support additional drivers
(beyond the four mentioned above) using a well-
2 GEMS Overview defined interface.
The heart of GEMS is the Ruby memory system GEMS’ modular design provides significant
simulator. As illustrated in Figure 1, GEMS pro- simulator configuration flexibility. For example,
3
Appears in Computer Architecture News (CAN), September 2005
4
Appears in Computer Architecture News (CAN), September 2005
complicated networks such as a CMP-DNUCA tem components such as cache controllers and
network [5]. directory controllers. Each controller is conceptu-
For snooping-based systems, Ruby has two ally a per-memory-block state machine, which
totally-ordered networks: a crossbar network and a includes:
hierarchical switch network. Both ordered net- • States: set of possible states for each cache
works use a hierarchy of one or more switches to block,
create a total order of coherence requests at the net-
• Events: conditions that trigger state transitions,
work’s root. This total order is enough for many
such as message arrivals,
broadcast-based snooping protocols, but it requires
that the specific cache-coherence protocol does not • Transitions: the cross-product of states and
rely on stronger timing properties provided by the events (based on the state and event, a transi-
more traditional bus-based interconnect. In addi- tion performs an atomic sequence of actions
tion, mechanisms for synchronous snoop response and changes the block to a new state), and
combining and other aspects of some bus-based • Actions: the specific operations performed
protocols are not supported. during a transition.
The topology of the interconnect is specified by For example, the SLICC code might specify a
a set of links between switches, and the actual rout- “Shared” state that allows read-only access for a
ing tables are re-calculated for each execution, block in a cache. When an external invalidation
allowing for additional topologies to be easily message arrives at the cache for a block in Shared, it
added to the system. The interconnect models vir- triggers an “Invalidation” event, which causes a
tual networks for different types and classes of mes- “Shared x Invalidation” transition to occur. This
sages, and it allows dynamic routing to be enabled transition specifies that the block should change to
or disabled on a per-virtual-network basis (to pro- the “Invalid” state. Before a transition can begin, all
vide point-to-point order if required). Each link of required resources must be available. This check
the interconnect has limited bandwidth, but the prevents mid-transition blocking. Such resource
interconnect does not model the details of the checking includes available cache frames, in-flight
physical or link-level layer. By default, infinite net- transaction buffer, space in an outgoing message
work buffering is assumed at the switches, but Ruby queue, etc. This resource check allows the control-
also supports finite buffering in certain networks. ler to always complete the entire sequence of
We believe that Ruby’s interconnect model is suffi- actions associated with the transition without
cient for coherence protocol and memory hierarchy blocking.
research, but a more detailed model of the inter-
SLICC is syntactically similar to C or C++, but it
connection network may need to be integrated for
is intentionally limited to constrain the specifica-
research focusing on low-level interconnection net-
tion to hardware-like structures. For example, no
work issues.
local variables or loops are allowed in the language.
3.2 Specification Language for Implement- We also added special language constructs for
ing Cache Coherence (SLICC) inserting messages into buffers and reading infor-
mation from the next message in a buffer.
One of our main motivations for creating
GEMS was to evaluate different coherence proto- Each controller specified in SLICC consists of
cols and coherence-based prediction. As such, flex- protocol-independent components, such as cache
ibility in specifying cache coherence protocols was memories and directories, as well as all fields in:
essential. Building upon our earlier work on table- caches, per-block directory information at the
driven specification of coherence protocols [23], we home node, in-flight transaction buffers, messages,
created SLICC (Specification Language for Imple- and any coherence predictors. These fields consist
menting Cache Coherence), a domain-specific lan- of primitive types such as addresses, bit-fields, sets,
guage that codifies our table-driven methodology. counters, and user-specified enumerations. Mes-
sages contain a message type tag (for statistics gath-
SLICC is based upon the idea of specifying indi-
ering) and a size field (for simulating contention on
vidual controller state machines that represent sys-
5
Appears in Computer Architecture News (CAN), September 2005
the interconnection network). A controller uses are hopeful the process can be partially or fully
these messages to communicate with other control- automated in the future. Finally, we have restricted
lers. Messages travel along the intra-chip and inter- SLICC in ways (e.g., no loops) in which we believe
chip interconnection networks. When a message will allow automatic translation of a SLICC specifi-
arrives at its destination, it generates a specific type cation directly into a synthesizable hardware
of event determined by the input message control description language (such as VHDL or Verilog).
logic of the particular controller (also specified in Such efforts are future work.
SLICC).
3.3 Ruby’s Release 1.0 Limitations
SLICC allows for the specification of many
types of invalidation-based cache coherence proto- Most of the limitations in Ruby release 1.0 are
cols and systems. As invalidation-based protocols specific to the implementation and not the general
are ubiquitous in current commercial systems, we framework. For example, Ruby release 1.0 supports
constructed SLICC to perform all operations on only physically-indexed caches, although support
cache block granularity (configurable, but canoni- for indexing the primary caches with virtual
cally 64 bytes). As such, the word-level granularity addresses could be added. Also, Ruby does not
required for update-based protocols is currently model the memory system traffic due to direct
not supported. SLICC is perhaps best suited for memory access (DMA) operations or memory-
specifying directory-based protocols (e.g., the pro- mapped I/O loads and stores. Instead of modeling
tocols used in the Stanford DASH [13] and the SGI these I/O operations, we simply count the number
Origin [12]), and other related protocols such as that occur. For our workloads, these operations are
AMD’s Opteron protocol [1, 10]. Although SLICC infrequent enough (compared to cache misses) to
can be used to specify broadcast snooping proto- have negligible relative impact on our simulations.
cols, SLICC assumes all protocols use an asynchro- Those researchers who wish to study more I/O
nous point-to-point network, and not the simpler intensive workloads may find it necessary to model
(but less scalable) synchronous system bus. The such effects.
GEMS release 1.0 distribution contains a SLICC
specification for an aggressive snooping protocol, a 4 Detailed Processor Model (Opal)
flat directory protocol, a protocol based on the Although GEMS can use Simics’ functional sim-
AMD Opteron [1, 10], two hierarchical directory ulator as a driver that approximates a system with
protocols suitable for CMP systems, and a Token simple in-order processor cores, capturing the tim-
Coherence protocol [16] for a hierarchical CMP ing of today’s dynamically-scheduled superscalar
system [17]. processors requires a more detailed timing model.
The SLICC compiler translates a SLICC specifi- GEMS includes Opal—also known as TFSim
cation into C++ code that links with the protocol- [18]—as a detailed timing model using the timing-
independent portions of the Ruby memory system first approach. Opal runs ahead of Simics’ func-
simulator. In this way, Ruby and SLICC are tightly tional simulation by fetching, decoding, predicting
integrated to the extent of being inseparable. In branches, dynamically scheduling, executing
addition to generating code for Ruby, the SLICC instructions, and speculatively accessing the mem-
language is intended to be used for a variety of pur- ory hierarchy. When Opal has determined that the
poses. First, the SLICC compiler generates HTML- time has come for an instruction to retire, it
based tables as documentation for each controller. instructs the functional simulation of the corre-
This concise and continuously-updated documen- sponding Simics processor to advance one instruc-
tation is helpful when developing and debugging tion. Opal then compares its processor states with
protocols. Example SLICC code and corresponding that of Simics to ensure that it executed the instruc-
HTML tables can be found online [15]. Second, the tion correctly. The vast majority of the time Opal
SLICC code has also served as the basis for translat- and Simics agree on the instruction execution;
ing protocols into a model-checkable format such however, when an interrupt, I/O operation, or rare
as TLA+ [2, 11] or Murphi [8]. Although such kernel-only instruction not implemented by Opal
efforts have thus far been manual translations, we occurs, Opal will detect the discrepancy and
6
Appears in Computer Architecture News (CAN), September 2005
7
Appears in Computer Architecture News (CAN), September 2005