Guillaume Girard Report

Automated testing using a reference instruction set simulator extracted from documentation
Guillaume GIRARD September 24, 2001
Abstract Implementing a complete and correct instruction set simulator of todays complex computer architectures is a difcult task. Automated testing is a way to improve the test coverage of such a simulator. Usually these test-cases are checked against a reference machine. This paper presents the development of an automated test framework: a reference simulator is implemented, partially extracted from the documentation, and an execution comparison system is set up to validate the functioning of the real simulator. A test generator is written and used to check the instruction set, with useful results to improve the simulator correctness.
Contents
1 Introduction 1.1 Virtutech and Simics . . 1.2 Automated testing . . . . 1.3 Aim of this thesis . . . . 1.4 Organization of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 5 6 6 8 9 9 9 10 11 13 14 14 14 15 16 17 19 21 21 21 22 24 24 25 25 25 27 27 28 33 34
Related work 2.1 Hardware description languages . . . . . . . . . . . . . . . . . . . . 2.2 Usual processor descriptions . . . . . . . . . . . . . . . . . . . . . . IA-64 Description 3.1 Instruction encoding . . . . . . . . . . . . . . . 3.1.1 Instruction ow . . . . . . . . . . . . . . 3.1.2 Encoding . . . . . . . . . . . . . . . . . 3.2 Register les . . . . . . . . . . . . . . . . . . . 3.3 Integer computations . . . . . . . . . . . . . . . 3.4 Predication . . . . . . . . . . . . . . . . . . . . 3.5 Speculation . . . . . . . . . . . . . . . . . . . . 3.5.1 Control speculation . . . . . . . . . . . . 3.5.2 Data speculation . . . . . . . . . . . . . 3.6 Register stack . . . . . . . . . . . . . . . . . . . 3.7 Register rotation . . . . . . . . . . . . . . . . . . 3.8 Floating-point architecture . . . . . . . . . . . . 3.9 Multimedia support . . . . . . . . . . . . . . . . 3.10 Memory organization . . . . . . . . . . . . . . . 3.10.1 Virtual addressing and memory protection 3.10.2 TLB . . . . . . . . . . . . . . . . . . . . 3.10.3 VHPT . . . . . . . . . . . . . . . . . . . 3.11 Interruption handling . . . . . . . . . . . . . . . 3.12 Debugging and performance monitoring . . . . . 3.12.1 Debugging . . . . . . . . . . . . . . . . 3.12.2 Performance monitoring . . . . . . . . . Extraction of information from the documentation 4.1 PDF and its shortcomings . . . . . . . . . . . . 4.2 Intel description format . . . . . . . . . . . . . 4.3 Pseudo-code . . . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
The reference simulator 5.1 Register les . . . . . . . . . . . 5.2 Instruction decoding . . . . . . . 5.3 Exception handling . . . . . . . . 5.4 Memory simulation . . . . . . . . 5.5 Register Stack Engine . . . . . . . 5.6 Translation Look-aside Buffers . . 5.7 Floating-point computation . . . . 5.8 Pseudo-code functions . . . . . . 5.9 Implementation dependent features 5.10 State saving and loading . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
35 35 37 38 39 39 39 40 40 41 41 42 42 42 44 46 47 48
Simics module 6.1 State les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 A test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work Conclusion
7 8
A State le format
Chapter 1
Introduction
The instruction set of a modern processor can reach up to several hundred instructions that perform complex operations on the registers and the memory systems. Implementing this set of instructions is a difcult task: correctness is achieved when every single bit takes its right value after the execution of the operations, whatever the input state provided. On todays computers, input state means megabytes of memory in a multiple levels hierarchy and huge register banks with strong interactions. However, though the input state is theoretically enormous, the changes are usually observable after each instruction has been run, and the range of their effects is limited. It is thus conceivable to validate an implementation by extensive verication of the results. The implementation of an instruction set simulator encounters the same problems: the passage from a simulator able to boot a modern operating system, to a bug-free simulator whose correctness could be validated is a time-consuming task. Most of the errors happen when running unusual or strange series of instructions, or with special arguments or values. It is far from easy to understand why a simple command like ls crashed on a simulator when an operating system like Solaris just booted and executed correctly one billion instructions. Extensive testing is a way to improve the correctness of such a simulator and to avoid the time spent in debugging elusive bugs later in the development process.
1.1
Virtutech and Simics
Virtutech is a company producing simulation tools for the development and design of high-performance computer systems. Its main product, Simics, is a simulator acting as one or more virtual workstations or servers. Simics can today model a variety of processor types and computer systemsboth workstations and multiprocessor servers, including SPARC-, Alpha-, and x86-based systems. Simics fully virtualizes the target, allowing multiple processors or multi-node systems to be modeled, with arbitrary memory and device conguration, regardless of the host system. Simics is sufciently fast to run realistically scaled commercial workloads, and sufciently accurate to boot and run unmodied operating systems, including Solaris, Linux, Tru64, and Windows NT. To accomplish this, Simics includes binary compatible models for several devices, including SCSI, graphics cards, console, interrupt, timers, EEPROM and Ethernet. Network simulation can connect to Simics Central, allowing a
virtual server to appear on the local network with full services available (NFS, NIS, rsh, etc.). Simics Central also allows the simulation of complete networks by connecting multiple Simics processes together, all within a single virtual clock domain. Such large models can be run parallelized, either on a multiprocessor host or on a network of workstations. Several types of processors are already simulated in Simics, and the aim is to cover all the important high-end architectures available on the computer market today. When a new architecture is introduced, the generic tools used at Virtutech make the development of a functional simulator a rather quick task, usually a matter of a few months. The Simics framework, along with some external programs, provides the automatic generation and optimization of a decoder and a disassembler or an efcient memory transaction handling to give some examples. However, Simics is confronted to the problem we exposed previously, that is going from a functional simulator to a completely correct implementation of the instruction set.
1.2
Automated testing
Automatic testing of the simulator is a way to nd and quickly correct corner-case bugs and to validate its implementation, as a generated set of tests will be more thorough and accurate in its testing than a human-being can be. To be implemented efciently, this solution implies the use of a reference machine to be able to detect inconsistencies and errors in the simulator state after the execution of one or several instructions. It also implies a program with enough knowledge about the architecture to generate relevant test-cases. The reference machine can be provided in different ways:

by using a real system. This method obviously provides the only complete and absolute reference model. It has a major drawback: it is difcult to use a real system as a reference since we need the possibility to do non-intrusive snapshots of the real machine, which is likely to be very difcult at system-level. It is in fact the very same reason for which simulation is an interesting tool in system development. by using a complete simulator, at the gate level. Such models are only available to the systems designers, and are well kept secrets. Because of their precision, theyre also very slow and not well-suited for high-level tests on the instruction set. by using another instruction set simulator. This method has the advantage of functioning even if the real system is still under development, or if the system does not even exist. It does however imply that the reference simulator is as complete and accurate as possible, and it also makes the testing phase more complex, since discrepancies can be introduced by bugs in any of the two simulators.
The test-case generator can at rst be reduced to a very simple program generating random pattern as instructions. This would be a good stress test on the decoding accuracy of the simulator. To produce more complex and realistic test cases, the program needs a description of the architecture (at least the register les and the instruction set) to generate legal series of instructions. More complex information could be provided to generate tests for specic systems like the Translation Look-aside Buffers (TLB). 4
The results of these tests need to be analysed and transmitted to the programmers for debugging when an error is discovered. This task can be performed by the reference machine with the help of information about implementation dependent features, so that irrelevant differences can be ltered out.
1.3
Aim of this thesis
This masters thesis is a practical application of the testing method described above for the development of the IA-64 architecture in Simics. It was performed in parallel with the implementation of the Simics simulator. As a reference machine, the choice was made to develop a reference simulator since the rst IA-64 implementation, the Itanium processor, was not publicly available yet. Intel provides a pseudo-C description of the entire instruction set that can be extracted to implement the semantics of the instructions, and thus improve the correctness of the simulator. The project was divided into several steps:
Automatically extracting as much information as possible from the Intel documentation so that the generation of the reference simulator would be as independent of human coding as possible. This includes extracting the format of the instructions as well as the pseudo-code describing their implementation.
Building a framework around the extracted information to get a working reference simulator.
Coding a Simics module to perform parallel tests on Simics and on the reference simulator and analyse the differences found to signal errors.
Developing a test-case generator by using the information extracted from the documentation.
As this represents quite a lot of work, the thesis itself was centered around the rst three points: getting a documentation generated simulator for the IA-64 and being able to compare it to the Simics implementation.
1.4
Organization of the report
Chapter 2 will describe related works that have been performed in automatic generation of simulators and testing of existing implementations. Chapter 3 will provide a short description of the IA-64 architecture. Chapters 4 to 7 will present how the different steps were carried out and with which degree of success. They will also emphasize the encountered problems as well as possible improvements.
Chapter 2
Related work
How to describe a processor at the instruction set level is a question that has been discussed for many years (I found papers dating from the late sixties, like [DUE68]) and it has generated a wide range of solutions. Languages used to provide a description can be sorted in different categories, according to their formalism:
English, a common choice as a non-formal description language.
Non-executed, readable pseudo-code languages with vague semantics. They usually contain typos and bugs that make them non-executable.
Pseudo-code generated from executed code, transformed to be readable. They can be close to a real programming language (Algol, C, etc.).
Executed code in a real programming language known outside the company who designed the chip.
Hardware description languages with completely dened semantics. They usually provide much more information on the processor structures. They also need a complete description of all the internal mechanisms where the other types of languages would refer to English descriptions.
Well make a quick survey of some of the existing languages and their application to simulation and testing.
2.1
Hardware description languages
The research world has produced quite a number of hardware description languages focused on different levels of abstraction (gates, components, register transfers, instruction set, etc.). Some languages present the processor at the instruction set level, including the semantics of the instructions. ISPS was designed around 1980 [SIE82, part 1, chapter 4]. The description is done in the following way: the memory declarations describe the structure of the processor (registers and memory), which will be considered as the state; the formats and operations describe the instruction format (bit elds) and the services functions (address
calculation, etc.); the interpreter decodes the instructions using the instruction format and performs the changes on the state of the processor as the instructions are run. The operations are described in a Pascal-like language. The instruction decoding is performed explicitly with the help of a DECODE statement applied to specic bit elds (it is similar to a switch/case in C). Although the language was primarily designed for documentation, a description can be easily compiled to produce a simulator of the processor described. LISAS is an instruction set description language developed in the beginning of the nineties [COO93, COO94]. It is a functional language describing a state and operations that transform the state. In a similar way to ISPS, the state is described rst (registers and memory), then the format (bit elds) of an instruction is dened. The instructions are provided by specifying the values of the encoding bit elds and setting eventual parameters, then the operations that will change the processor state. Decoding is done implicitly. Some languages include also micro-architecture information, like latency, execution units or pipeline simulation. nML was presented in [FAU95]. It describes the structure of the processor and the behavior of its instructions. Each of them is described by its binary encoding, its assembly syntax and the actions it performs. nML introduce a simple timing-model based on latency specication. nML was extended into Sim-nML [RAJ98], by introducing the concept of resource usage. This allows a more accurate simulation of pipelines and latencies. Using SimnML, several tools have been developed to generate simulators, assemblers, disassemblers, proling tools, a compiler back-end, etc. [1]. LISA is a language designed from the start to produce bit and cycle/phase accurate processor models [ZIV96]. It adds to the other languages a strong pipeline model based on resource constraints. Instructions are dened by their binary encoding, their scheduling in the pipeline model and the operations that are performed on the processor structure. It has been successfully used to describe cycle-accurate models for DSP [PEE99]. Some languages are focused on the encoding/decoding task. SLED (Specication Language for Encoding and Decoding) was presented in 1997 as part of the New Jersey Machine-Code Toolkit [RAM97a]. It is designed to describe the encoding of the instructions, and thus contains no state information. Once the bit elds are dened, patterns describe the binary representation of the instructions and constructors connect symbolic, assembly-language and binary representations of instructions. SLED does not dene semantics for the instructions, and can be used in an application program either to encode or decode instructions. SLED descriptions were used to generate automatically complete tests of instruction sets [RAM97b]. The goal was to compare SLED descriptions to an independent reference assembler to check their correctness. The tests were designed from the description themselves which provided the way instructions were built (encoding bits and arguments). Simgen, like SLED, is oriented toward the encoding and the decoding of binary instructions [LAR97]. Bit elds are dened and instructions are described with their bit pattern (to identify the binary encoded instruction) and their assembly syntax. Simgen was written to optimize the decoding phase and produce a fast C decoder. It provides
a semantic eld for each instruction, to be lled with C code which is not checked but directly output when generating the C decoder. The decoder then becomes a simulator. Another type of languages used to described hardware components are the full description languages that can model from the gates to the instruction set. VHDL [SHA86] and even more Verilog [2] are the standards used for this kind of modeling. However, using them to describe processors tends to be so complex that it is impossible to work at the instruction set level and describe the operations involved in a simple manner. MIMOLA [BAS94] is an attempt to provide a tool in-between, where a processor is still simple to describe, but where one can control up to the gate level if its needed. Plenty of other languages have not be cited here and t in the different categories or have specics of their own. Among others, RADL, EXPRESSION, ISDL, HMDES, Maril, VLW, etc.
2.2
Usual processor descriptions
Interestingly, the ofcial descriptions of the most important processor architectures use no formally dened languages like those presented above. I have found no publications related to the development and testing methods these companies use, or in what languages they describe the different levels of the chips they design. They certainly have internal simulators to perform validation and they sometimes seem to generate from them the pseudo-code included in their documentation. As far as the encoding is concerned, all of them present tables to specify opcode bit elds and values. For the instructions operations, Sparc simply provides an English description; IA-32 instructions are described in a non-executed pseudo-code, whereas Alpha and PPC present pseudo-code that has been executed; the IA-64 description uses an extended C language. It can probably be argued that the description presented in the manuals is targeted toward human readability rather than formalism, and thus pseudo-codes are chosen rather than new, formally correct languages. In the same way, tables are more readable for encoding than bit elds denitions and matching. In all the architectures, the processor structure itself is always described in English: the oating point unit and the memory management unit, to give a few examples, are never provided in a formal language. Comparing real hardware description languages to the documentation of the IA-64 specication highlights the lack of formal information provided by the later, especially concerning the processor structure and internal mechanisms. They will have to be programmed in the reference simulator based on their English description, which is less than ideal for correctness purposes. In the same way, generating tests from this documentation may prove difcult without providing the test generator with additional structure information. A more in-depth study of the IA-64 code will also show that the level of abstraction chosen for the pseudo-code is variable, in that internal mechanisms are sometimes made explicit although they are hidden the rest of the time without strong reasons to do so.
Chapter 3
IA-64 Description
The IA-64 is an entirely new architecture developed during the past decade by Intel and Hewlett-Packard. It is a complex design with several advanced features like full predication, speculation and explicit instruction parallelism. The rst processor implementing the IA-64 specications has been the Intel Itanium which is available since May 2001.
3.1
3.1.1
Instruction encoding
Instruction ow
In the IA-64 architecture, the instruction ow follows stricter rules than those usually imposed in other similar processors. The ow is divided in groups of successive instructions. All the instructions inside a given group are independent from each other, so that their order of execution does not matter: they can be executed all in parallel or reorganized to match the best use of the processors resources. The allocation of the groups is done at compile time, and the stops that limit the end of a group and the beginning of a new one are encoded directly into the instruction ow1 . When a stop is found, all the pending results from the current group are committed to the registers and the execution can continue to the next one. The dependencies that may not occur between instructions in the same group are rather simple in the general case but include also some subtle and complex rules:

read-after-write (RAW) and write-after-write (WAW) register dependencies are not allowed within an instruction group. write-after-read (WAR) is allowed except in some special cases. This includes explicit register access (registers encoded into the instruction) and implicit register access (control registers accessed by some instructions as a side-effect). RAW, WAW and WAR dependencies are allowed for memory accesses. A load from a previously written address within the instruction group will return the latest written value.
1 Some instructions (some branches and a return from interruption for example) can dynamically end a group even if no stop is present.
3.1.2
Instructions are encoded by three into a bundle of 128 bits. Each bundle is divided into a template of 5 bits, and three instruction slots of 41 bits (See gure 3.1).
127 87 86 46 45 5 4 0
Figure 3.1: IA-64 Instruction Encoding Format (from [IA64-v1], Figure 315) The template provides two pieces of information:
Table 3.1 provides a part of the possible template encoding as examples. A complete reference can be found in [IA64-v1] in section 3.3. A double vertical line indicates a group stop between two instructions. For example, the template 03 has a stop after instruction slot 1 and instruction slot 2, while the template 00 does not stop the current instruction group. Table 3.2 shows the different types of instructions and on which execution unit they are scheduled. The L+X type is a special type for some instructions that are encoded in 82 bits instead of 41. These are scheduled on an I-unit (for example 64-bits immediate loading into a register) except for the long branch which is executed on a B-unit. Its important to mention that not all combinations of unit types are available as templates, and only the most common have been provided (24 in the current architecture). The compiler will add no-operations to complete the bundles if needed. Note as well that bundles and groups are completely independent concepts. The Itanium processor, for example, can load two bundles simultaneously and execute them in parallel if no stop occurs and enough units are available. The use of bundles however implies Template 00 01 02 03 1C 1D Slot 0 M-unit M-unit M-unit M-unit ... M-unit M-unit ... Table 3.1: Template Field Encoding and Instruction Slot Mapping (from [IA64-v1], Table 3-10) F-unit F-unit B-unit B-unit Slot 1 I-unit I-unit I-unit I-unit Slot 2 I-unit I-unit I-unit I-unit
Some specic instructions have different requirement in position (rst or last in a group) and in dependencies. [IA64-v1] in section 3.4 provides more detailed information.
Encoding
instruction slot 2
41
instruction slot 1
41
instruction slot 0
41
template
5
The slot mapping of the instructions, that is on which type of unit they will be executed. The instruction group stops.
10
Instruction type A I M F B L+X
Description Integer ALU Non-ALU integer Memory Floating-point Branch Extended
Execution Unit Type I-unit or M-unit I-unit M-unit F-unit B-unit I-unit/B-unit
Table 3.2: Relationship between Instruction Type and Execution Unit Type (from [IA64-v1], Table 3-9) that jumps can only target the beginning of a bundle since theres no way to indicate which slot should be selected.
3.2
Register les
The IA-64 architecture denes the following registers (See also gure 3.2): General registers (GR) It is a bank of 128 64-bits registers provided for integer and multimedia computations and available at all privilege levels. Each of these registers has an additional bit called Not a Thing bit (NaT) for speculation purpose (see section 3.5). GR0 is always 0 and is read-only. GR0 to GR31 are called the static general registers. GR32 to GR127 are called the stacked general registers, and are available to a program by allocating a stack frame of local and output registers in a similar way a procedure allocates stack space in memory for its local variables. The mechanism of the register stack engine is described later (section 3.6). A portion of the stacked general registers can be programmed to rotate for some specic optimized loops (see section 3.7). A special bank of 16 registers can replace GR16 to GR31 to provide quickly scratch registers for code like interrupt handlers, and thus avoid saving and restoring manipulations. This bank switch can only be performed by system-level code. Floating-point registers (FR) It is a bank of 128 82-bits oating-point registers. The oating-point formats and operations are described in section 3.8. FR0 and FR1 are read-only and contains respectively +0.0 and +1.0. All the other registers are freely available for applications. As for the stacked general registers, FR32 to FR127 can be programmed to rotate during some specic loops. Predicates registers (PR) It is a bank of 64 1-bit registers that are used in predication (see section 3.4). PR0 is always 1 (true) and ignores all writes. PR16 to PR63 can be programmed to rotate in some specic loops. Branch registers (BR) It is a bank of 8 64-bits registers used to specify addresses for indirect branches. Among other things, they are used to store the return address when calling a procedure. Instruction Pointer (IP) IP is a 64 bits register containing the address (virtual or physical, according to the processor mode) of the current bundle being executed. IP is always 16-bytes aligned (128 bits).
11
Figure 3.2: System register model (from [IA64-v2] gure 31)
12
Current Frame Marker (CFM) This register holds the state of the current general register frame, that is it keeps track of the last allocation that was done in the stacked general registers. It also keeps track of the rotating state of the general, oating-point and predicate registers. Application Registers (AR) This is a set of 128 64-bits registers, although most of them are reserved. They contain control information available at application level, like the register stack engine conguration, the oating-point status register and the loop counters. Advanced Load Address Table (ALAT) The ALAT is an intern table used for data speculation (see section 3.5). Processor identication registers (CPUID) It is a set of several registers describing the processor manufacturer, the version of the architecture and the processor specic version number. It also provides information about the optional features implemented. Performance monitor data/conguration registers (PMD/PMC) These two sets of registers are used to accumulate performance information during the processor execution. The conguration registers are only available to privileged programs and can be set to count various types of events. This information is collected into the data registers, which can eventually be accessed by application-level programs. See section 3.12 for more information. Region Registers (RR) This is a set of registers used in virtual memory addressing (see section 3.10). Protection Key registers (PKR) These registers contain keys used to protect memory areas in virtual memory addressing (see section 3.10). Translation Look-aside Buffers (TLB) These are intern tables described in section 3.10. They contain the page translation lists for virtual memory management. Processor Status Register (PSR) This register contains the main options and status bits controlling the functions of the processor, like endianness, virtual memory options. . . The rst 6 bits of the PSR form the User Mask and are available to application level programs. Breakpoints registers (IBR/IBR) These registers provide hardware breakpoints for instructions and memory accesses. Control Registers (CR) This is a bank of 128 64-bits registers that controls system level conguration and information for exceptions and external interrupts. Most of them are reserved. They are only accessible to system-level programs.
3.3
Integer computations
Arithmetic instructions (addition, subtraction, shift left and add).
The integer execution units provide several types of instructions:
Logical instructions (and, or, and complement, xor). 13
3.4
The IA-64 instruction set is fully-predicated: every instruction is encoded with a 6 bits index value referencing a predicate register. If this register is equal to 1 (true), the instruction is executed; if not, except in rare cases, it acts as a no-operation. PR0 is always 1, so instructions predicated by PR0 are always executed. Several types of instructions set predicate registers to 0 or 1 according to their result. They are comparisons, oating-point comparisons, bit testing, NaT bit testing and some specic oating-point instructions. The IA-64 specication denes several ways of setting the predicates registers according to the result to obtain. [IA64-v1] in section 4.3 provides more information on these cases. As an example (from [IA64-v1], section 8.5), consider the following code:
if (a) { b = c + d; } if (e) { h = i - j; }
This code can be compiled into IA-64 assembly without any branches:
// // cmp.ne p3,p4=e,r0;; // // // (p1) add b=c,d // (p3) sub h=i,j // cmp.ne p1,p2=a,r0 p1 p2 p3 p4 ;; if if takes the value of takes the opposite takes the value of takes the opposite ends the group (p1) then add (p3) then sub (a != 0) value (e != 0) value
3.5
Speculation allows the compiler to load data in advance and thus reduce the memory latencies usually introduced by these operations. Two types of speculation are provided.
3.5.1
In control speculation, the processor performs the memory load in advance but defers all exceptions encountered (that would prevent it from loading the value and call the system to know what to do) and set instead the NaT bit of the target register to 1 to 14
32-bits integer operations (add pointer, shift left and add pointer, sign extend, zero extend). Bit eld instructions (signed/unsigned shift right, shift left, extraction, deposit, paired shift right). Immediate value loading (short move (22 bits) and long move (64 bits) to register).
Predication
Speculation
Control speculation
indicate that its value is not valid. This NaT bit propagates in all computations to insure no wrong result is computed from an invalid load. For the oating-point registers, a special NaTVal encoding is used instead of a separate NaT bit. When the value itself or a subsequent result is to be used, it can be checked by a chk instruction that will branch to a recovery code if the value is invalid, or simply continue if it is valid. This system allows loads and their dependent uses to be safely moved above branches, as in the following example (from [IA64-v1] section 8.4.3):
(p1) br.cond.dptk L1 ld8 r3 = [r5] ;; shr r7 = r3,r87 // Cycle 0 // Cycle 1 // Cycle 3
If we suppose the load has a latency of 2 cycles, the shift right will stall waiting for the result. Using control speculation, the code can be rewritten as:
ld8.s r3 = [r5] // Earlier cycle - speculative load // other instructions (p1) br.cond.dptk L1;; // Cycle 0 chk.s r3, recovery // Cycle 1 - checking the load shr r7 = r3,r87 // Cycle 1
If an exception is raised during the load, it will be deferred and the NaT bit for r3 will be set to 1. If p1 is true, the conditional branch is taken and the load value is never used, so everything run as if the load had never been performed. If p1 is false, a check occurs on r3. If the NaT bit is set, the processor jumps to the recovery code to get the correct value (and let the system handles the exception). If not, the load has already been performed which makes that the processor does not stall waiting for the memory. This code assumes that r5 is ready at an earlier point and that there are some instructions to insert between the speculative load and the conditional branch.
3.5.2
Data speculation
Data speculation allows the compiler to schedule loads in advance despite the fact that some conicting memory writes may occur before the value is used. The processor keeps a table of the advanced loads (the ALAT or Advanced Load Address Table) which is updated every time a store is performed. If the store invalidates a previous advanced load (by writing to an overlapping memory area), the load entry disappears, and when the load value is checked, the processor reissues the load as if nothing had been done. If the load entry is still present, the value is still valid and can be used directly. This is illustrated by the following code (from [IA64-v1] section 8.4.4). The compiler generates a write to memory, and it does not know if the memory area the processor will be writing to will overlap with the memory it will read at the next load:
st8 [r55] = r45 ld8 r3 = [r5];; shr r7 = r3,r87 // Cycle 0 - write // Cycle 0 - read // Cycle 2 - operation
However, it can use an advanced load, thus producing the following code:
15
ld8.a r3 = [r5];; // Earlier cycle - advanced load // Other instructions st8 [r55] = r45 ld8.c r3 = [r5] shr r7 = r3,r87 // Cycle 0 // Cycle 0 - checking the load // Cycle 0
If the store has written to a memory area that is different from the load memory area, the entry for the load in the ALAT will still be present and the check will simply conrm that the value is valid. If the store has overwritten a part of the memory area loaded during the store, the load entry will have disappeared and the processor will re-issue the load, thus stalling for some cycles. Such an optimization is of course to be used when the compiler can compute that there is a high probability that the load and the store wont overlap. It is possible to combine advanced and speculative loads. It is possible as well to check an advanced load with the chk instruction to branch to a recovery code instead of simply reloading the value. Note that contrary to the control speculation, where the NaT state is propagated, the ALAT keeps a value that depends on the register used as target for the load, so the check can only be performed on the original target register.
3.6
Register stack
The IA-64 architecture provides a complex mechanism to allow procedures to call other subroutines without having to explicitly save the registers they use, and even pass directly parameters and results to each other. A part of the general registers called the stacked general registers (from GR32 up to GR127) is used as a stack. When a procedure is called, it automatically inherits some stack registers from its parent (they are called input registers). It can also allocate more registers on the stack with the alloc instruction. These new registers are divided between the local registers that are available only to the current procedure, and the output registers, which will become input registers for any child procedure it will call. Figure 3.3 illustrates this mechanism. With this system of overlapping input/output registers, parameters and results can be directly passed without extra-copying. This process is made invisible to the procedure themselves since the rst current register on the stack is always GR32 and the following registers are renamed accordingly. The Register Stack Engine (RSE) is responsible for spilling and lling the stacked general registers according to the allocation demands. When a procedure asks for more registers than are physically available, the RSE spills some or all of the old registers to a program specic memory area called backing store. When procedures return and stack frames are restored, the RSE ensures that the allocated registers are physically present, and if not it loads them from the backing store. Note that the RSE keeps track of the NaT bits as well. Figure 3.4 illustrates this mechanism. The RSE handles spills and lls independently from the program and the processor so its functioning is invisible for the programmer. It can be programmed to anticipate on the instruction ow by spilling or restoring more registers than necessary. In this mode, it does not raise any exception for non-mandatory load/store.
16
Figure 3.3: Register stack behavior on procedure call and return (from [IA64-v1] Figure 41)
3.7
Register rotation
The IA-64 supports a feature called rotating registers to improve the speed of loops by lling completely the processor pipeline. Let consider a simple loop (from [IA64-v1] section 11):
L1: ld4 r4 = add r7 = st4 [r6] br.cloop [r5],4 r4,r9 = r7, 4 L1 // // // // Cycle Cycle Cycle Cycle 0 - 2 cycles latency 2 3 3
The successive memory accesses and the registers dependencies introduce latencies thus preventing the processor from parallelizing instructions in the core of the loop and limiting the speed to four cycle per iteration. One solution consists in performing several iterations of the loop at the same time to improve the use of the memory units. This is called loop-unrolling and would give the following code (with two iterations at the same time):
// setup the address registers add r15 = 4,r5 add r16 = 4,r6
17
Figure 3.4: Relationship between physical registers and backing store (from [IA64-v2] Figure 61)
L1: ld4 r4 = [r5],8 ld4 r14 = [r15],8 add r7 = r4,r9 add r17 = r14,r9 st4 [r6] = r7, 8 st4 [r16] = r17, 8 br.cloop L1 // // // // // // // Cycle Cycle Cycle Cycle Cycle Cycle Cycle 0 0 2 2 3 3 3
Note that the loop uses two different address registers to remove the dependencies introduced by the post increment operation. This allows the processor to perform two iterations of the loop in 4 cycles. This result can be improved by unrolling four iterations of the loop, which would make use of the cycle 1 not used in the two iterations version, and would run four iterations in ve cycles (see [IA64-v1] section 11.3.1). The IA-64 architecture provides a way of doing software pipelining of loops without any code expansion, through a set of rotating registers. Let consider the following loop:
mov lc = 199 // LC = loop count - 1 mov ec = 4 // EC = epilog stages + 1 mov pr.rot = 1<<16 // PR16 = 1, rest = 0 L1: (p16) ld4 r32 = [r5], 4 (p18) add r35 = r34, r9 (p19) st4 [r6] = r36, 4 br.ctop L1 // // // // Cycle Cycle Cycle Cycle 0 0 0 0
18
The goal is to let the processor enable parallel iterations with the help of the predicate registers. At each execution of the instruction br.ctop, the general and predicate registers are rotated so that their index number is increased by 1. If we look at the execution of the loop, we get the trace shown in table 3.3. Cycle 0 1 2 3 ... 100 ... 199 200 201 202 ... M ld4 ld4 ld4 ld4 ... ld4 ... ld4 Instructions/Units I M B br.ctop br.ctop add br.ctop add st4 br.ctop ... ... ... add st4 br.ctop ... ... ... add st4 br.ctop add st4 br.ctop st4 br.ctop br.ctop State before br.ctop p17 p18 p19 LC 0 0 0 199 1 0 0 198 1 1 0 197 1 1 1 196 ... ... ... ... 1 1 1 99 ... ... ... ... 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0
p16 1 1 1 1 ... 1 ... 1 0 0 0 0
EC 4 4 4 4 ... 4 ... 4 3 2 1 0
Table 3.3: Instruction trace for the rotating register loop.
At cycle 0, only PR16 is equal to 1 so ld4 is the only instruction executed. At cycle 1, the registers are rotated so GR32 becomes GR33 and PR16 becomes PR17. The br.ctop enables the new PR16 to 1 so the ld4 is executed with the new GR32 as target. At cycle 2, PR16, PR17 and PR18 are enabled so the add is executed as well. It refers to GR34 which is the rst GR32 having already rotated twice. At cycle 3, GR35 (the rst GR32 loaded) is written to memory. The loop is now fully pipelined and the processor performs one step of each four iterations for every cycle, which is equivalent to one iteration per cycle. The epilogue count is used to get out of the loop: the different predicate registers are progressively switched to 0 to stop the parallel operations. The amount of registers that rotate is congurable for the general registers (a part of the stacked general registers), and xed for the predicates and oating-point registers (PR16 to PR63 and FR32 to FR127). The current rotating values are stored in the CFM register. The architecture provides also more features like while loops, explicit prologue and epilogue, multiple-exit loops, etc. (see [IA64-v1] section 11).
3.8
Floating-point architecture
The IA-64 denes a new oating-point encoding fully compliant with the IEEE-754 standard [IEEE-std]. The format of a oating-point register is described in gure 3.5.
81 80 64 63 0
sign
1
exponent
17
signicand
64
Figure 3.5: Floating-point register format (from [IA64-v1], Figure 51)
19
Class or subclass NaNs Quiet NaNs Signaling NaNs Innity Normalized numbers Integer or parallel FP Unormalized numbers
Sign 0/1 0/1 0/1 0/1 0/1 0 0/1 0/1 0/1 1 0 0/1 1 0 0/1 0 0
Integer or parallel FP Pseudo-zeroes
NaTVal Zero FR0 FR1
Biased exponent 0x1FFFF 0x1FFFF 0x1FFFF 0x1FFFF 0x0001 through 0x1FFFE 0x1003E 0x00000 0x00001 through 0x1FFFE 0x00001 through 0x1FFFD 0x1FFFE 0x1003E 0x00001 through 0x1FFFD 0x1FFFE 0x1FFFE 0x00000 0x00000 0x0FFFF
Signicand (i.bb. . . bb) 1.000. . . 01 through 1.111. . . 11 1.100. . . 00 through 1.111. . . 11 1.000. . . 01 through 1.011. . . 11 1.000. . . 00 1.000. . . 00 through 1.111. . . 11 1.000. . . 00 through 1.111. . . 11 0.000. . . 01 through 1.111. . . 11 0.000. . . 01 through 1.111. . . 11 0x000. . . 00 0.000. . . 00 0.000. . . 00 through 0.111. . . 11 0.000. . . 00 0.000. . . 00 0.000. . . 00 0.000. . . 00 0.000. . . 00 1.000. . . 00
Table 3.4: Floating-point register encodings (from [IA64-v1] Table 52) The value of a nite oating-point number with a non-zero exponent eld can be computed as follow:

sign
2 exponent
65535
If the exponent is zero, the following formula applies: 1

sign
2 exponent
16382
The implicit bit is always present in the register format as it is usually done for the double-extended formats dened by the IEEE standard. There are two special ways of using the oating-point registers:
as an integer value, where the signicand contains the 64-bits integer.
as two single oating-point values of 32 bits each, packed into the signicand. This is called parallel oating-point mode.
Table 3.4 provides a list of the main encodings that can be used. The NaTVal value is used when doing speculation and is propagated in the same way as the NaT bit for general registers (see section 3.5). All registers encodings are allowed as inputs to arithmetic operations. However some encodings can never be generated as a result. The section 5.1 in [IA64-v1] provides a complete description for all cases. The oating-point execution units can provide results in all the standard IEEE 754 formats, that is with 24, 53 or 64 bits of precision for the signicand, and 8, 11, 15 or 20
signicand 63 signicand 62 : 0
signicand 63 signicand 62 : 0
17 bits of precision for the exponent. The different precisions can be combined freely. The oating-point status register reports all the faults and traps dened by the standard that occur during the computation. It allows also some faults or traps to be ignored and simply reported instead of generating a call to the fault handler. The IA-64 architecture describes a special mechanism that allows the processor to generate a software assistance fault when needed. The Itanium processor uses this possibility to handle unormalized numbers in software rather than in hardware and thus reduces the complexity of its oating-point computation units. This software handling is invisible for the programs. The following oating-point operations are dened:
Memory access for various formats including parallel-fp (load and store with memory/register format conversion).
Transfer between general and oating-point registers.
Arithmetic operations (fp multiply and add, fp reciprocal approximation, fp reciprocal square root approximation, fp compare, fp min/max, conversion from/to integer). These operations are also available for parallel-fp mode.
Non-arithmetic instructions (classify, merge, mix, sign extend, pack, swap, binary operations, select).
Status register instructions (check, set, clear).
Integer multiply and add instruction.
3.9
Multimedia support
The multimedia instructions treat the content of the general registers as concatenations of eight 8-bits, four 16-bits or two 32-bits elements that are manipulated independently. The instruction set covers parallel addition, subtraction, average, average of a difference, shift left/right and add, compare, multiply, multiply and shift right, sum of absolute difference and min/max. Three modes of computations exist: modulo (results are wrapped when they overow), signed and unsigned saturation. The section 4.6 in [IA64-v1] describes these instructions more precisely.
3.10
Memory organization
This section is a short introduction to the memory management in the IA-64 architecture. For a complete reference, see [IA64-v2] chapter 4.
3.10.1
Virtual addressing and memory protection
For IA-64 programs, the virtual addressing model is fundamentally a 64-bits at linear virtual address space. A 64-bits pointer is divided as shown on gure 3.6. The 3 highest bits are used as an index to select a region register. This register contains a region identier of 24 bits, as well as some other parameters. The rest of the bits represents the virtual address into the region. When the processor translates a virtual address to its physical equivalent, it proceeds in several steps (see gure 3.7): 21
63
61
60
region
3
virtual address
61
Figure 3.6: A 64-bit pointer
Figure 3.7: Conceptual virtual address translation (from [IA64-v2] Figure 42)

The three highest bit of the pointer are extracted to select the region register. The region register is read to get the region ID. The couple formed by the region ID and the virtual address is searched in the TLB for a matching translation. If a translation is found, the access rights are checked for the page found. If the access is allowed, the physical address is computed from the physical page number and the lowest bits of the virtual address (offset) and used by the current instruction.
Virtual addressing can be enabled separately for instruction, data and RSE accesses. The following sections describe the different mechanisms involved in the translation.
3.10.2
TLB
The IA-64 architecture denes two distinct TLBs, the Instruction TLB for instruction fetches and the Data TLB for data and RSE accesses. These tables are divided into two sub-sections called translation registers and translation cache. 22
The translations registers are a xed-size array entirely managed by software. They allow the operating system to keep in the TLBs critical memory translations like sensitive interruption code or kernel memory areas. An entry can be inserted in a specic slot by the instruction itr. No overlapping translation may be inserted or the processor behavior may be undened. The minimal size of the translation registers is 8 slots for each of them. The translation cache is a dynamic structure controlled by the processor which is responsible for choosing which page translations to replace or remove according to implementation-dependent algorithms. It is intended to be used for the more common, non-critical virtual memory translations in a multitasking environment. Software can insert entries into the cache but can not assume anything about the time these entries will last (although some rules exist to prevent the last inserted entry to be removed immediately). The minimal size for the translation caches is 1 entry for each of them. Several purge instructions are provided to manage the TLBs. When theyre applied to the translation caches, the application is insured that at least the selected pages have been removed, but the purge may be more important. Insertion may also cause additional purges by the processor. A TLB entry contains the following information about the page:

the memory attribute eld that describes the cacheability and other memory related features or limitations. the privilege level (0-3). the access rights description. the physical address. the page size. the protection key. the virtual address. the region ID. several other bits to control if the page is present, if its dirty, if exceptions can be deferred. . .
The access rights are controlled in two ways: The access right eld coupled with the privilege level of the page and the privilege level of the program indicate which of the operations Read, Write and Execute are authorized on the page. Some special combinations increase the programs own privilege level (for system calls for example). A second control is done through the protection keys: the protection key ID is searched throughout the protection key registers. If a match is found, the access rights can be further restricted according to the bits that are set in the register.
Page sizes between 4Kb and 256Mb are allowed in the TLBs (and up to 4Gb for purges). Pages should be aligned on their natural boundaries.
23
Figure 3.8: Interruption classication (from [IA64-v2] Figure 51)
3.10.3
VHPT
The Virtual Hash Page Table (VHPT) is an extension of the TLB tables located in memory used to improve virtual address translation. An IA-64 processor can optionally implement a hardware page walker for the VHPT, so that if a search in the TLBs is not successful, the processor can look for the translation in the VHPT without invoking any operating system code. The VHPT is located in the virtual memory space. It can be congured as the main OS page table, or as a big cache for the translations. The processor does not update the VHPT nor does it insure that the TLBs and the VHPT are consistent. When it nds a translation in the VHPT, a new entry is added in the corresponding translation cache and the execution continues as if the translation had been found in the TLB. The VHPT can be congured for two formats: a short format for a per-region linear page table where each entry occupies 8 bytes and a long format for a single large virtual page table where entries are 32 bytes in size. The VHPT functions and conguration system are described in [IA64-v2] sections 4.1.5 to 4.1.7.
3.11
Interruption handling
In an IA-64 system, the interruptions can be divided in two main categories according to the way they are handled:

IVA-based interruptions are serviced by the operating system through the interruption vector in CR2. PAL-based interruptions are serviced by the Processor Abstraction Layer (PAL) rmware, the system rmware or the operating system.
Interruptions can be divided in four types: Aborts, Interrupts, Faults and Traps. Figure 3.8 presents these types and the way they are handled. Aborts An abort interrupt is executed when the processor has detected an internal malfunction or a processor reset. Interrupts An external device has requested the attention of the processor.
24
Faults The current instruction is trying to perform an invalid or unauthorized operation. Traps The current instruction has executed correctly but needs the system attention. When an interrupt occurs, the processor saves the minimum state information to allow the software to handle the event. The saved state is stored in several control registers associated with the interrupt management. External interrupt delivery is disabled and the bank of 16 scratch general registers is enabled automatically. The Interruption Vector Table (IVT) is a 32Kb memory area for 68 different interrupts. The rst 20 vectors contains 64 bundles each so that performance-critical interrupts can be handled directly. The 48 other vectors provide 16 bundles per vector. Several vectors have more than one interruption associated to them. [IA64-v2] in sections 5 and 8 provides a more detailed explanation and a list of the interruptions and interruption vectors.
3.12
3.12.1
Debugging and performance monitoring

Debugging
Several Data Breakpoint Registers (DBR) and Instruction Breakpoint Registers (IBR) are dened to hold address breakpoint values. Along with them, the IA-64 architecture denes the following facilities: Break Instruction fault The instruction break results in a break instruction fault with an immediate value. Instruction Debug fault When the processor loads an address that matches the parameters in the IBR, an instruction debug fault is raised. Data Debug fault When the processor executes a memory operation on an address matching the parameters in the DBR, a data debug fault is raised. Single Step trap After each successful instruction, a single step trap is raised by the processor. Taken Branch trap A taken branch trap is raised on every instruction taking a branch. Lower Privilege Transfer trap A lower privilege transfer trap is taken if a branch reduces the privilege level. The IBR/DBR are divided between odd and even registers: odd registers contain the conguration and mask information while even registers contain the address to match. Every even register is associated with the odd register with the next index value (registers 0 and 1, 2 and 3, etc.).
3.12.2
Performance monitoring
Two banks of registers are provided to count a large number of different events. PMC registers are controls registers while PMD registers contain the accumulated data. PMC and PMD registers are associated by pair (PMC4 with PMD4, etc.). The rst four PMC registers are not conguration registers. A counter can report an event through an interruption. When this is the case, or when a counter overows, 25
the corresponding bit for the counter in the 256 bits of the rst four PMC registers is set to 1 to indicate to a program which counter raised the interruption. A huge quantity of events can be monitored. They belong to the following categories:

Basic events, like clock cycle and retired instructions Instruction execution (decode, issue, execution, speculation, memory operation). Cycle accounting events (stall). Branch events (prediction). Memory hierarchy (instruction and data caches). System events (TLBs).
26
Chapter 4
Extraction of information from the documentation

The IA-64 documentation provided by Intel is encoded as Portable Document Format les [PDF99]. [IA64-v3] is the most important for us as it contains a complete description of the instructions along with pseudo-code and encoding/decoding information. Extracting automatically this information in a usable form would be a huge step forward in generating a simulator as close to the documentation as possible.
4.1
PDF and its shortcomings
PDF is a very useful cross-platform format, very well suited to provide ready-to-print documents. A document encoded in PDF is divided in pages, each of them being built from vector-based descriptions. The basic elements of this description are fonts, characters and paths1 . This makes PDF a format that contains very little semantic information about its content: although it allows a user to search throughout the text, no difference can be made between a drawing and the lines that compose a table or between a title and an annotation. Moreover, PDF is a very complex format to manipulate. Some tools exist to convert PDF to other formats:

Postscript is a natural target since its description possibilities are a superset of PDFs. However, it does not present much interest except for printing purpose, its encoding being even more complex. Several software can export PDF to Postscript: among others ghostscript [3] and xpdf [4]. The text contained in the PDF les can be extracted in different ways. Some perl scripts simply remove the PDF commands from the le. The most efcient tool is once again xpdf [4] which draws the PDF information on a text page, just as it draws it on a graphics page in the normal case. It usually handles pretty well line breaks and spacings, and provides a reasonably close text version of the PDF document, at least to the human eye.
of curves and lines that can be drawn and lled with different styles and colors.
1 series
27
The goal of the conversion was to provide a format simple to parse with a minimum of human intervention. The text conversion has been found to be the closest to these goals as far as the pseudo-code of the instructions is concerned. An example of transformation is provided as gures 4.1 and 4.2. Because PDF does not use the concept of tables, the decoding/encoding information could not be extracted to something usable. The encoding tables are heavily dependent on the position of the bit markers, that is the numbers that mark the size and the position of the different elds. It is impossible to use the text conversion there because its not precise enough. In the same way, most of the values are centered vertically and horizontally in the table cell they occupy, and cells sometimes span several rows or columns. This makes the structure of a table very difcult to grasp just from the positioning of the values it contains. The pseudo-code of the add instruction provides a good example of a mistake caused by the conversion from the PDF format. At the end of the code, one can read:
if (plus1_form) GR[r1] = tmp_src + GR[r3] + 1; elseGR[r1] = tmp_src + GR[r3];
which should obviously be read as:

if (plus1_form) GR[r1] = tmp_src + GR[r3] + 1; else GR[r1] = tmp_src + GR[r3];
These problems are relatively few on the whole instruction set but human intervention is needed in the process.
4.2
The description format used for each IA-64 instruction is the following: after the instructions name, a series of sections provides different kind of information. Figure 4.1 is a good, although simple example of the format.
2 Most
Some tools try to convert PDF (or Postscript) to another popular cross-platform format, HTML. The inherent lack of structure information makes this transformation quite difcult and often no better than a simple text extraction2.
Intel description format
The format section is composed of lines describing the different variations of the instruction. The assembly syntax is given rst, then ags that make explicit the particularities of the current variation. These ags are used as boolean values in the pseudo-code to decide which operations to perform. Instead of these ags, the fact that the current variation is a pseudo-op of a more complex instruction can be precised. Last comes the reference number for the instruction in the encoding tables. The description section is an English explanation of the parameters and the eventual complement values, as well as the behavior of the instruction in specic cases.
of the tools do not even produce any output on the Intel documentation.
28
Figure 4.1: ADD instructions description (from [IA64-v3] p. 23)
29
add
Add Format:
(qp) (qp) (qp) (qp) (qp)
add r1 = r2, r3 add r1 = r2, r3, 1 add r1 = imm, r3 adds r1 = imm14, r3 addl r1 = imm22, r3
register_form plus1_form, register_form pseudo-op imm14_form imm22_form
A1 A1 A4 A5
Description:
The two source operands (and an optional constant 1) are added and the result placed in GR r1. In the register form the first operand is GR r2; in the imm_14 form the first operand is taken from the sign-extended imm14 encoding field; in the imm22_form the first operand is taken from the sign-extended imm22 encoding field. In the imm22_form, GR r3 can specify only GRs 0, 1, 2 and 3. The plus1_form is available only in the register_form (although the equivalent effect in the immediate forms can be achieved by adjusting the immediate). The immediate-form pseudo-op chooses the imm14_form or imm22_form based upon the size of the immediate operand and the value of r3. if (PR[qp]) { check_target_registe r(r1); if (register_form) tmp_src = GR[r2]; else if (imm14_form) tmp_src = sign_ext(imm14, 14); else tmp_src = sign_ext(imm22, 22); tmp_nat = (register_form ? GR[r2].nat : 0); if (plus1_form) GR[r1] = tmp_src + GR[r3] + 1; elseGR[r1] = tmp_src + GR[r3]; GR[r1].nat = tmp_nat || GR[r3].nat; } // register form // 14-bit immediate form // 22-bit immediate form
Operation:
Interruptions: Illegal Operation fault
IA-64 Instruction Reference
2-3
Figure 4.2: ADD instructions description transformed by xpdf
30
IA-64 Documentation in PDF format
pdftotext
IA-64 Documentation in text format
perl script
Comment files
pseudo-code files
pseudo-code parser
concatenation
C++ files
indent
Final C++ files
Human readable C++ files
Figure 4.3: The documentation extraction process

The operation section contains the pseudo-code of the instruction. The format used for the pseudo-code is described later (see section 4.3). The interruptions section lists all the non-trivial interruptions raised by the instruction. The mapping section is a complementary description present for some oatingpoint instructions. The fp exceptions section describes the faults and traps that can be raised by oating-point instructions. The serialization section describes if and how serialization is needed to complete the results of the instruction.
Figure 4.3 present the whole extraction process. Well describe it step by step. To parse the text le obtained after conversion by xpdf, a perl script was written with the following functions: it reads through the whole text le containing all the instructions, and extract two les for each real instruction: a code le and a comment le. Pseudo-instructions are removed. All the documentation is gathered as a C/C++ comment. it parses the format section to guess the arguments that will be used in the pseudo-code and to identify pseudo-instructions. It also performs some guessing work on the type of these arguments according to their names. it performs some simple search-and-replace transformation in the pseudo-code and remove all comments. for each instruction, it wraps the pseudo-code into a function, where all the arguments found in the format section are considered as function arguments. 31
void ia64_add( int qp, int r1, int r2, int r3, int_64 imm14, int_64 imm22, bool imm14_form, bool imm22_form, bool plus1_form, bool register_form) { if (PR[qp]) { check_target_registe r(r1); if (register_form) tmp_src = GR[r2]; else if (imm14_form) tmp_src = sign_ext(imm14, 14); else tmp_src = sign_ext(imm22, 22); tmp_nat = (register_form ? GR[r2].nat : 0); if (plus1_form) GR[r1] = tmp_src + GR[r3] + 1; elseGR[r1] = tmp_src + GR[r3]; GR[r1].nat = tmp_nat || GR[r3].nat; } } /* Instruction: add * * Add * (qp) add r1 = r2, r3 ... * Illegal Operation fault */
register_form
A1
Figure 4.4: ADD instructions description transformed by the perl script
#include "framework.h" #include "pseudo-functions.h" #include "ia64-functions.h" /* Instruction: add * * Add * (qp) add r1 = r2, r3 ... * Illegal Operation fault */ void ia64_add (int qp, int r1, int r2, int r3, int_64 imm14, int_64 imm22, bool imm14_form, bool plus1_form, bool register_form) { notype tmp_src; notype tmp_nat; notype elseGR; if (PR[qp]) { check_target_register (r1); if (register_form) tmp_src = GR[r2]; else if (imm14_form) tmp_src = sign_ext (imm14, 14); else tmp_src = sign_ext (imm22, 22); tmp_nat = (register_form ? GR[r2].nat : 0); if (plus1_form) GR[r1] = tmp_src + GR[r3] + 1; elseGR[r1] = tmp_src + GR[r3]; GR[r1].nat = tmp_nat || GR[r3].nat; } }
register_form
A1
Figure 4.5: ADD instructions description at the end of the extraction process
32
Figure 4.4 illustrates the results for the parsing of the add instruction. This parsing method is rather efcient but fails in some cases. When the text information is extracted, the italic style is lost and this leads to confusing errors. In the format section, the arguments are written in italic whereas the name of the instruction is written in roman. This makes the parser sometimes nd an argument that is not valid. For example, in parsing mov.ret.mwh.ih b1 r2 tag13 , ret is incorrectly understood as an argument. Some more complex arguments are difcult to parse as well, like itr.d dtr[r3 ] r2 . The parser tries to guess as many arguments as it can, and let the pseudo-code parser handle them and remove them if necessary.
4.3
Pseudo-code
The Intel pseudo-code describing the behavior of the IA-64 instructions is a slightly enhanced C. The following language constructs have been added:

msb:lsb and bit are post-x operators used for selecting a range of bits. The msb:lsb selects the bits between the bit number lsb (where bit 0 is the lowest signicant bit in the number) and msb. The bit variation selects only one bit. u>, u>=, u<, u<=, u>>, u>>=, u+ and u* are like the usual operators but consider their operands as unsigned.

enumerations for arguments are written as values whereas other enumerate values are classical uppercase symbols.
The Intel pseudo-code does not describe the complete operation of the instructions. When these operations can be coded in C in a simple way, the code is directly integrated. Otherwise the operations are delegated to pseudo-code functions which are described only in English in the manual. They often provide access to some part of the IA-64 processor structure (like the Register Stack Engine), perform low-level operations (oating-point computation) or contain the logic for exception checking (memory, oating-point, etc.). The pseudo-code is very dependent on them and thus most of them have to be implemented before instructions can be run on a simulator. Theyre described in more details in chapter 5. In the add instruction example, check target register() and sign ext() are pseudo-code functions that respectively check if the target register is valid (i.e. if it is inside the stack frame and not zero) and sign-extend a value from a given highest bit. A parser for this extended C dialect was built in C (with the help of lex and yacc). The parser transforms the code with the following goals:
it looks for the arguments that are actually used in the pseudo-code and remove from the functions denition those that are not useful.
it looks for unknown symbols and denes them as local variables. Some help is provided through a list of the symbols that will be dened externally (pseudocode functions, register banks names, etc.).
it transforms the bit operations in function calls. 33
The code of the functions is then passed through indent [5] and the documentation is added at the beginning of the le. The specied includes directives are also added so that the code will be ready to compile. Figure 4.5 shows the contents of the le add.cc after all the extraction process has completed. One can notice that the elseGR error has been parsed successfully and considered as a local variable. Human intervention is necessary to correct the problem. Only two instructions generate errors during the process: frcpa and frsqrta. They dene a help function inside their pseudo-code which is wrapped inside the function created by the perl script and thus generates a syntax error. It is enough to cut out this function to solve the problem. The local variables are not typed automatically. Although it would be possible in most cases to guess this type from the context or at least the variable name, it can be misleading and not always accurate. Floating-point functions provide good examples of where complex typing is needed. Some functions use arrays of values whose type is not easily guessed from context. In some rare cases, some variables are in fact global variables (like slot in br which refers to the current slot in the instruction bundle). Moreover, some pseudo-code functions are not described and therefore not present in the list of known symbols. It is necessary that a human performs a check on each of the generated le to catch eventual errors and perform the more complex typing. The parser also generates a header le that contains the denition of all the functions as well as the enumerations needed to make them compile. These enumerations values may need to be adapted since some of them are common to many functions (like memory access ags) while others are used in one instruction only.
4.4
At the end of the extraction process, each real IA-64 instruction is encoded in a le containing the documentation and a function implementing the pseudo-code of the manual. The code is C/C++ compliant and can be compiled with the proper external denitions and the correct typing for the local variables and the functions arguments. A header le has been created with the list of the functions and enumerations denitions. The entire pseudo-code of the Intel documentation is thus available as usable C code. However, the encoding information could not be processed and needs to be provided by human coding.
it translates the unsigned operators by casting the arguments to unsigned integers of 64 bits. it transforms the quoted enumerations values into real constants that are dened next to the function.
Results
34
Chapter 5
The reference simulator

The structure of the reference simulator was designed to use the pseudo-code extracted from the Intel documentation with as few changes as possible. Since the pseudo-code is mainly C, a C-like language was the most viable alternative. C++ was chosen for its capacity to redene operators (and thus hide the complex operations that are not taken into account in the Intel pseudo-code) and the inheritance system that would make typing easier and would allow more generic functions. The reference simulator is divided in several parts:
The framework of the simulator, which contains the state of the processor and the denitions needed so that the pseudo-code can run. It also implements the mechanisms (TLB, RSE, etc.) that are invisible to the programmer.
The pseudo-code extracted from the documentation, which is ready to be used by a decoder.
The pseudo-code functions that are needed for the pseudo-code to run, which provides a number of services and access points to the framework of the processor itself.
The binary instruction decoder which calls the right functions with the relevant arguments.
This chapter will describe how the simulation of the different aspects of an IA64 processor were implemented. It will also cover the framework needed to use the reference simulator as a reference for Simics. It provides however only an overview of the implementation choices and possibilities. Figure 5.3 presents a summary of the structure of the reference simulator with a size evaluation for each module in terms of lines of code.
5.1
Register les
A class hierarchy was dened to handle the different types of registers. Most of them are 64-bits wide registers, sometimes divided in bit elds. The Intel pseudo-code accesses registers either as a whole 64-bits value or as a structure of bit elds. It uses most of the assignment, comparisons and computation operators on the registers or the 35
Register
-value: int_64
RegisterPart
-value: &int_64 (from a register) -startBit: int -endBit: int
Ignored_reg
-value: int_64 = reading returns 0
OneBitRegister
-value: boolean
Reserved_reg
-value: int_64 = reading is not allowed
FloatRegister
+sign: int +exponent: int +significand: int_64
BitControlledRegister
-BitRegStatus[64]: enum{normal, reserved, ignored}
RSC_reg
+mode: RegisterPart +pl: RegisterPart +be: RegisterPart +loadrs: RegisterPart
Other specific registers
GeneralRegister
+nat: OneBitRegister
RegisterVector
+operator[](index:int): T&
GeneralRegisterVector
+operator[](index:int): GeneralRegister& (as visible for the software) +getRSEReg(index:int): GeneralRegister& (without rotation) +getPhysReg(index:int): GeneralRegister& (physical register))
FloatRegisterVector
+operator[](index:int): FloatRegister& (as visible for the software) +getPhysReg(index:int): FloatRegister& (physical register))
PredicateRegisterVector
+operator[](index:int): PredicateRegister& (as visible for the software) +getPhysReg(index:int): PredicateRegister& (physical register))
Figure 5.1: Class hierarchy for register types
instruction add_register_register(qp,r1,r2,r3) pattern aslot_type == 1 && opcode == 8 && a_x2a == 0 && a_ve == 0 && a_x4 == 0 && a_x2b == 0 syntax "(p{d:qp})\t add r{d:r1} = r{d:r2}, r{d:r3}" semantics #{ ia64_add(qp,r1,r2,r 3,0,0,false,false,true); #}
Figure 5.2: Example of simgen encoding for the instruction add reg, reg
36
elds as well. To handle these requirements, the following class hierarchy was dened (see also gure 5.1):
Register denes a 64-bit register with all the operators that can be applied to it (including a cast to a 64 bits integer).
Ignored reg and Reserved reg dene two registers that respect Intels denition of ignored and reserved registers for read and write of values (an ignored register can be written to but returns always 0, whereas a reserved register can not be accessed).
BitControlledRegister is a register where every bit has a status among the list: normal, ignored, reserved. An ignored bit is always 0 and can be written to, but do not change. A reserved bit is always 0 and raise an exception if a write tries to set it to 1. Read and write in BitControlledRegisters are done according to their bit tables.
RegisterPart contains a reference to a register value, as well as a bit eld denition (start bit, end bit). When a RegisterPart is read, it extracts its bit eld from the register value. When it is written to, it sets its bit eld in the register to the given value.
By combining BitControlledRegister and RegisterPart, all specic registers can be described and be accessed as a whole or as a structure.
OneBitRegister denes a 1-bit register for predicate and NaT-bits.
FloatRegister denes a oating-point register of 82-bits.
GeneralRegister denes a 64-bits register with a NaT value.
The register banks are dened as a xed-size template vector called RegisterVector. Three specializations handle rotating registers and the Register Stack Engine operations, GeneralRegisterVector, FloatRegisterVector and PredicateRegisterVector. All of these classes redene the operator [] so that the pseudo-code can transparently access the right registers while ignoring the rotating and renaming operations happening in the background. In the Intel pseudo-code, some registers are accessed as structure despite the fact that they are referenced from a vector (like AR[RSC].be). To handle this properly, the pseudo-code has been transformed into a reference to a register (RSC real.be). The names of the registers have been dened as enumerations to be used directly as a vector index (AR[RSC]), and the vector contains a reference to the XXX real registers at the appropriate index. The entire set of registers of the IA-64 is described in one le. These registers are dened globally to be consistent with the pseudo-code usage. Since theyre global, they have been dened so that initialization needs to be performed separately from their constructor. This allows the setup code to set some implementation-dependent variables before creating the IA-64 framework.
5.2
Instruction decoding
The instruction decoding was written with simgen, a internal tool developed at Virtutech. It was kept as simple and straightforward as possible and thus uses only a few 37
IA-64 Documentation
Register framework (1500 lines) Pseudo-code functions (4750 lines)
Internal processor system TLB, interrupts, execution (1670 lines) Softfloat Memory device (200 lines)
Extraction process
Pseudo-code (10070 lines)
Decoder (3120 lines)
State file manipulation (800 lines)
Reference Simulator
Figure 5.3: The reference simulatorThe lines number refers to the number of code lines used in each module. A straight arrow shows that the respective module has been extracted from the documentation whereas a dashed arrow emphasizes that the module was based on an English description. features available. The list of the bit elds used is completely dened, then each encoding is described as simply as possible. Instructions are grouped only when it is obviously clearer to do so. The only operation performed by a decoded instruction is to call the corresponding pseudo-code function with the appropriate arguments. Some instructions perform some check on the arguments and raise exceptions if needed. An example of encoding is given in gure 5.2. The instruction add_register_ register is dened with the parameters qp, r1, r2 and r3. Those are previously dened bit elds that will be extracted from the binary instruction. The bit pattern to recognize the instruction is dened, then a human readable syntax is provided, so that the decoder can print out the instruction it is decoding. Finally the semantics part contains the code that will be executed if this instruction is decoded, which is a simple call to the generated function ia64 add with the right parameters set. The last arguments of the functions are the boolean ags that will control the execution (here register form is true while imm14 form and plus1 form are false).
5.3
Exception handling
Exceptions are raised by the raiseInterruption() function. If the exception is to be deferred, it is simply stored as a reference and to check if other exceptions will be precluded 1. If the exception is raised, all registers that can be set at this point are given their value (some registers that give more information about the current state, the
1 In the IA-64 model, an exception can be deferred and simply ignored. However, some subsequent exceptions can be deferred automatically even if the system does not ask the processor to do so, because a previously related and more important exception has been deferred. They are said to be precluded by the rst deferred exception.
38
registers that should be saved, etc.). Some specic registers are considered as already set by the function that can eventually raise this exception (it is mostly the case of memory accesses that set registers with the faulting address or the VHPT information). To handle deferred exceptions correctly, these functions set shadow registers that are only copied to the real registers if an exception is raised. When an interruption has been raised and the registers are correctly set, a C++ exception is thrown to prevent any further execution of the current instruction. This exception is to be caught by the execution unit. Traps are handled in a similar way, except that no C++ exception is thrown since the traps are raised only by the execution unit. The setting of the registers after an exception is not completely dened in the IA64 architecture, that is some registers settings are implementation-dependent and thus can have any value. This is handled with a boolean value added to every register that indicates if the current value is implementation-dependent or not, and if it should be included in the comparisons when using the reference simulator.
5.4
Memory simulation
Memory is simulated in a very simple manner: It is allocated when needed by blocks of xed size (by default 4Kb), physical addresses can be 64-bits wide, read and write operations are independent of the hosts endianness. Non-allocated addresses do always return 0. Values from 1 to 16 bytes can be read/written in one call.
5.5
Register Stack Engine
The Register Stack Engine is implemented exactly as it is described in [IA64-v2] in chapter 6 (and briey explained in section 3.6). The same state values are kept and updated. The RSE implements only the lazy mode, that is it never performs load or store in advance but only when needed. The code that updates the RSE is in the pseudocode functions.
5.6
Translation Look-aside Buffers
The TLBs are dened from a versatile vector structure that can work as a xed-size array or as an innite-size list. Fixed-size array mode is used for the TLB registers, which are accessed by index. Innite-size list mode is used for the TLB caches: the reference simulator needs to keep a complete track of the TLB operations, so that it can eventually cope with an implementation that is slightly different from its own. Indeed, all the algorithms to update the TLB caches are left to the implementation, providing that they follow some basic rules to keep the instruction ow running. The reference simulator needs to be able to handle a TLB miss that would have been a hit in its own system, or to handle purges more important than what was required by software. At the same time it must be able to insure the validity of a TLB hit and not blindly follow what the other implementation does. This should be done by keeping all pages translations until a miss shows that they have been purged by the other implementation 2.
2 The reference simulator is working this way. However, since no TLB related events are collected yet, it never removes pages from its cache except during an explicit purge.
39
The TLB arrays also contain the basic code to match and purge pages according to the rules dened by the IA-64 specications.
5.7
Floating-point computation
Floating point computation is done with an extension of the library softoat [6]. This library performs IEEE754 [IEEE-std] oating-point computations with numbers of size 32, 64, 80 and 128 bits. It was extended to handle 82 bits numbers as dened in the IA-64 specications:
the type floatx82 was dened.
the addition, subtraction, multiplication and rounding functions were adapted to 82 bits wide oating-point numbers. The rounding was extended so that it could be performed with a precision of 24, 53 or 64 bits in the signicand, and 7, 11, 15 or 17 bits in the exponent. the fused multiply-add operation3 was introduced along with the special type floatx82_infp which is designed to keep enough precision during the multiplication and the addition so that they can be considered as innite precision operations. The rounding is done after they have been performed.

the fpa oating-point information has been added to the rounding code. It signals the fact that the magnitude of the delivered result is greater that the magnitude of the innite precision result before rounding. Since it is not part of the IEEE standard, it does not deliver a trap by itself.
the oating-point standard comparisons have been ported as well.
The pseudo-code functions that actually perform the computation are based on these extensions of softoat.
5.8
Pseudo-code functions
The pseudo-code functions perform a large number of tasks so that the pseudo-code itself is not clobbered with complex computations and implements only the top-level logic of the instruction. They can be classied in categories:

Generic help functions (concatenation, sign/zero extending, shift, etc.). Processor generic functions (template information, register check, ignored and reserved check, etc.). TLB related functions (TLB searching, exception check, etc.). Physical memory functions (alignment, read, write, semaphore, cache, etc.). RSE related functions. Exceptions raising.

!
are a b
3 This
instruction performs the operation a b c , a b c , etc.
c in innite precision and round the nal result. Variants
40
5.9
In the framework and the pseudo-code functions, functions and values that are not described in the IA-64 manual (and thus implementation-dependent) have been separated so it is relatively easy to modify the reference simulator to have something else than an Itanium processor. This includes the numbers of registers in certain banks, the size of some elds, the size of the virtual and physical address spaces, TLB algorithms, exception checking, etc. As a note, it is important to consider that the Itanium internal features are not completely available to the public, and thus the reference simulator is an IA-64 generic simulator, that can be customized to match as closely as possible a specic implementation (currently the Itanium). When some features are not described precisely enough, the reference simulator should try to match a superset of the possibilities and adapt its behavior according to the other processor (the TLBs are a good example). This is not completely implemented in the current reference simulator but the framework has been designed with this goal in mind.
5.10
To act as a comparison reference machine, the reference simulator needs to be able to load and save the state of the processor and gather information about what happens in the execution. A state le format was dened: it contains a start state with the registers and memory values that are needed, information about how many instructions to run, if interruptions happened at a specic instructions, if memory accesses were done. . . , and nally the end state. The whole format is described in appendix A. The IA-64 framework contains code to save and restore its state, as well as a parser to read the state les and execute them. It keeps a list of the events happening during execution. It also contains a simple comparator that checks if the nal state of the reference simulator is identical to the end state in the state le. At this time, the state les and the comparison are not complete. General and predicated registers are handled correctly, as well as some other specic registers (IP, CFM, etc.). Floating-point registers are ignored, memory accesses are ignored as well, TLB related events are not collected. . . This is due to two majors reasons: the IA-64 Simics simulator is not complete so the IA-64 structure is not completely available to the user, and most of the function hooks to gather events are not handled yet; the way the state le and the events are handled is dependent on the tests that are performed, and no test generator has been designed yet, except some simple examples that are described in the next section. However, the framework to perform a complete comparison and handle more events is in place and should be easy to extend.
Floating-point exception checking. Floating-point computations. Floating-point memory operations. Implementation dependent algorithms (TLB insert, VHPT tags, etc.).
Implementation dependent features
State saving and loading
41
Chapter 6
Simics module
To be able to compare Simics execution to the reference simulator, we need to collect state information from Simics, which means we need a module able to save and restore state les. Since Simics can be entirely scripted in Python, it was an obvious choice to quickly develop this module, and to create a simple test generator to validate the comparison system.
6.1
State les
Two classes were written in Python to handle the state les: one is generic (stateFile) and the other, specialized for the IA-64, is derived from the rst (ia64StateFile). They provide functions to write a complete state le (beginning and end states, events) and to read the start state from a state le. They also provide functions to print a generic state le on the screen. With these two classes, two Python scripts are provided: print-state-file.py is a stand-alone script to print a generic state le. It is far from perfect since it doesnt interpret any of the implementation specic elds but it should be able to print any le following the standard. rerun-test.py provides a loadTest() function to be used inside Simics to load the start state from a state le. Simics is then ready to repeat the test.
6.2
A test generator
A simple test generator was written in Python. Its goal is to generate random instruction patterns to test if the decoding and execution of these instructions are done correctly. If we consider the binary instruction as composed of encoding bits (they dene which instruction will be decoded) and parameter bits (they dene which registers or immediate values will be used), it is possible to loop over the whole encoding space to test all the instructions, and to set the parameters to interesting values. This is what is done by the test generator. For each test, the script sets up a bundle where the instruction 0 and 2 are nooperations, and the instruction 1 is the instruction to test. The second position was chosen because it is the only slot that accepts all the instruction types dened by the
42
Simics Simics module

- generating instruction to run - generating a random state - running the instruction - writing the state file
State file
Reference simulator
- reading the state file - executing the instruction - comparing the results - writing an eventual error report
Error report
Figure 6.1: The test process. standard (I, M, B, FX being rather special). It is possible to specify encoding bits in the binary encoded instruction for which all bit combinations will be tried, that is the test generator is going to test all the possible ways to ll these bit elds. For combinatorial explosion reasons, these bit elds should not excess a certain size. It is also possible to specify bit elds that will take some specic values (i.e. parameters). The test generator will then generate a test for each encoding it was asked to test, produce a random state for the processor, run the test in Simics and create a state le containing as much information as possible. This state le is directly read by the reference simulator, which performs the test and prints out all the differences found. This is actually done through a named pipe where the Python script writes and the reference simulator reads. If a difference is found during or at the end of the execution, the corresponding state le is then saved for further analysis. Including the le test-instruction-range.py provides a function called test InstructionRange() which takes the following arguments:

The le or named pipe used to write output. The instruction type for the slot 1 (I=0,M=1,F=2,B=3). The list of bits that will be entirely tested [(start bit, length), ...]. The start value for these bits. This value is used to ll progressively all the bit elds dened above (it must be a Python long integer). The end value. The start and end values control the length of the test since all the values in-between will be used. The list of bits that take specic values (arguments) [(start bit, length), ...]. A list of values to put into the bits areas dened above [v1, v2, ...] (it must be Python long integers). The current version of the test generator tries all the possible combinations of these values with the dened arguments, which can lead to huge tests. A list of three values that are used as seeds in the random generator. These values are provided so it is possible to run the test several times in exactly the same way.
43
6.3
Results
The test generator has been run on the entire I-instruction encoding with a random state and valid parameters. Here is a transcript of the whole test. First we create a named pipe in a temporary directory, then we run Simics and load the conguration for an IA-64 machine with a memory device.
bash> mkfifo /tmp/npipe bash> simics-ia64-itanium simics> read-configuration gurra.conf simics> source test-instruction-range.py
We run the reference simulator on the named pipe and we let it wait for Simics to begin generating tests. The -sf ag asks the reference simulator to save state les on error.
bash> simics-compare /tmp/npipe -sf
The test is then launched on Simics.

simics> @testInstructionRange("/tmp/npipe", 0, [(13,1), (27,14)], 0L, 0x7FFFL, [(6,7), (13,7), (20,7)], [4L, 5L], [1,1,1])
The arguments are the following: the output is written in /tmp/npipe, the bit encoding areas are 1 bit at position 13 and all the last 14 bits of the instruction. Since all of the bits are covered, the test goes from 0 to 0x7FFF or 215 1. The arguments are the three registers indices that can be encoded; they take either the value 4 or 5, and all possibilities are tested (23 8). We will thus generate 8 215 262 000 tests. From this the reference simulator produces 35 000 errors les that are reduced to 100 after a small perl script parsed the les to remove redundant instructions. Among the errors detected, here are some representative examples:
mov.i ar4 = 4 raised an exception on the reference simulator but not on Simics. These applications registers are not available when the instruction is in an I-slot and Simics did not check it.
"
zxt1 r4 = r4 was not decoded by Simics because it is not implemented yet. mov pr.rot = immediate did not set the predicate registers correctly in Simics.
It is important to note that some errors are generated by the reference simulator itself as it contains bugs of its own. Hopefully these bugs do not extend to the Intel pseudo-code which stays the reference for the instructions functioning. Several problems arise with this type of test: The number of errors is quite impressive, since the test can become highly redundant. A script to sort out things is denitely very helpful. However, it cant be too smart without knowing a lot about the instruction set. The solution used here is to reduce errors to one per instruction. It is possible that the script will skip some errors, but they will hopefully be caught in the next test, when the previous errors are corrected. 44
Despite some problems, this test shows quite nicely what can be done with the reference simulator in term of regression testing. A series of similar tests running all the time would probably catch most of the new errors eventually introduced in the Simics simulator. The test itself took 7 hours to run on a Pentium III 900MHz, and most of this time was used by the very slow and unoptimized Python test generator code. The reference simulator itself was running only 10% of the time. As an estimate of the quality of the results, 15 Simics related bugs (not counting unimplemented instructions) were discovered during the test, while 2 bugs in the reference simulator produced false error reports. A second-run some days later on a more recent version of Simics discovered bugs in newly implemented instructions and validated some of the previous bug corrections.
This test is pretty straightforward for arithmetic instructions, but can become difcult to set up and check when memory accesses are involved: random generated addresses tend to get out of the address space. Moreover, it is not very smart to transfer the whole memory to compare its state before and after the instruction. A new scheme is needed where memory accesses are trapped and directed by the test generator and memory values are generated and stored on the y.
45
Chapter 7
Future work
There are several areas that deserve attention if the IA-64 reference simulator is to be further developed.

The simulator was developed in about four months. It is not complete (parallel oating-point instructions are not implemented) and neither is it a fully featured IA-64 simulator (some functionalities were kept to the minimum level imposed by the documentation). A better VHTP simulation is denitely needed to complete the memory management unit. To provide realistic prediction handling, the Advanced Load Address Table should be implemented. These two systems increase the need for the reference simulator to handle implementations different from its own in the comparison process, since they introduce a lot of implementation-dependent features. More testing is needed, above all for the virtual memory system, the RSE engine and the oating-point error settings. In the extraction process, typing variables by guessing could be extended and rened to simplify the work of the programmer. It seems difcult at this time to implement a better typing process due to the lack of information available to the parser. The simple test generator could be extended for all types of instructions by including memory transactions in the state le. This is not as simple as it may seem since memory handling includes checking if the written or read values are correct, but also if no other values have been changed. Obviously one does not want to transmit whole memory snapshots in the state le, so some intermediate scheme is needed. The whole testing process needs to be automated even more. The error extraction has to be more user-friendly and provide more options. The error reports should be codied properly so they could be parsed automatically, sorted or searched. The test generation could be done more intelligently by providing more information on the processor structure and testing more specic points than random instructions (for example by parsing the simgen le to get instruction encoding information).
46
Chapter 8
Conclusion
The goal of this thesis was to provide a new method to test Simics at the instruction-set level. A reference simulator for the IA-64 Itanium was implemented with the pseudocode extracted from the Intel documentation. It includes a test framework to compare the correctness of Simics in running instructions. A simple test generator was developed to validate the method and gave good results. The quality of the extraction process for the pseudo-code is good and nearly completely automated. It would only take a few days to process a new version of the documentation and generate an up-to-date reference simulator, provided that the format of the pseudo-code does not change too much. In the same way, implementing an IA-64 processor different from the Itanium requires only a few changes in reasonably well delimited areas. The extracted code represents 11000 lines of code, that is around 50% of the total code of the reference simulator. Mesured against the few weeks spent on the extraction process development and the greater correctness achieved through it, it conrms the choice of a documentation-based simulator as a good system for a reference machine. The comparison method has been shown to be useful and to produce interesting and exploitable results. In fact the reference simulator is currently used in two different ways: as a reference decoder when Simics does not decode a specic instruction or decodes it wrong, and as a reference simulator during the tests. It is intended to be a regression testing tool also when Simics IA-64 has reached a sufcient maturity level. This masters thesis has validated the testing model that was proposed, which can be extended and used with better test generators during Simics IA-64 development to reduce the implementation time and improve the quality of the simulator.
47
Appendix A
State le format
The following text describes the format of the le saving the state of a processor before and after the execution of a sequence of instructions, as well as the events happening during the sequence, and other information that may be relevant. The le is encoded in binary format. It contains several sections identied by a specic section number of 32 bits (whose MSB is 1). The generic format of the le is the following:

4 bytes for the section number. For each eld in the corresponding section: 4 bytes for the identier (whose MSB is 0). 4 bytes for the binary length of the eld (excluding the 8 mandatory bytes) in bytes. the rest of the information is eld dependent.
Since the eld encodes its length, an unknown eld can be skipped easily. Since a section begins by a magic number greater than 0x80000000, and a eld begins by a magic number smaller than 0x80000000, an unknow section can be skipped by skipping all the elds it contains to the next section. Sections should be provided if possible in order. Table A.1 gives the section numbers attributed. Section Header Start registers Start memory Info ASM Code End registers End memory End of le Magic number 0xFF0000F0 or 0xF00000FF 0x80000001 0x80000003 0x80000010 0x80000011 0x80000012 0x80000002 0x80000004 0xFFFFFFFF
Table A.1: Sections and magic numbers
48
0xFF 0x00 0x00 0xF0 for big endian, 0xF0 0x00 0x00 0xFF for little endian, this is the magic number and the endian detector. All successive values will be encoded in the selected format.
Architecture: Two highest bytes are architecture, two lowest are sub-versions (IA64 - Itanium = 0x1A64 0x0000)
Software version (0x10000 is the rst draft version)

$
Start/End registers This is a dump of register values, respectively at the beginning and at the end of the instruction run. A eld is encoded as follow:
4 bytes for the register number which is architecture dependant (and MSB is 0). The two highest byte denes the register banks while the two lowest denes the register number in the bank.
4 bytes for the length of the data.
length bytes for the content.

$
Start/End memory This is a dump of several memory areas of various size. A eld is encoded as follow:
4 bytes to indicates a memory page number (if needed).
4 bytes for the size of the data (including information like physical address, size of the area, encoding mode, etc.). The data encoding is implementation specic.
$
Info This is a per-executed-instruction based information. Each info eld is identied by the instruction number that generated it during the run. There may be more than one info eld per instruction. The elds are sorted by increasing instruction number. The format of a eld is the following:
4 bytes to indicates the instruction number in the running sequence.
4 bytes for the size of the data.

$
ASM This is a per-executed-instruction based disassembled code. The eld encoding is dened as:
4 bytes to indicates the instruction number in the running sequence.
4 bytes for the size of the string. The string should end with a 0x00 byte for C-like languages.
$
Code This is a per-executed-instruction based binary code. This section can be used to provide the code to execute instead of using a memory area. The format is dened as usual.
# # # # #
Header
This section is composed of three 32 bits values:
49
Bibliography
[BAS94] S. Bashford, The MIMOLA Language, version 4.1, September 1994 [COO93] Todd A. Cook, Paul D. Franzon, Ed A. Harcourt, Thomas K. Miller III, System-Level Specication of Instruction Sets, 1993 [COO94] Todd A. Cook, Ed Harcourt, A Functional Specication Language for Instruction Set Architectures, January 1994 [DUE68] Duley, J.R., and Dietmeyer, D. L., A Digital System Design Language (DDL), IEEE Transactions on Computers, C-17 (9), 1968, p. 850 [FAU95] A. Fauth, J. Van Praet, M. Freericks, Describing Instruction set Processors Using nML, March 1995 [IA64-v1] Intel Corp., IA-64 Architecture Software Developers Manual, Volume 1: IA-64 Application Architecture, Revision 1.1, July 2000 [IA64-v2] Intel Corp., IA-64 Architecture Software Developers Manual, Volume 2: IA-64 System Architecture, Revision 1.1, July 2000 [IA64-v3] Intel Corp., IA-64 Architecture Software Developers Manual, Volume 3: IA-64 Instruction Set Reference, Revision 1.1, July 2000 [IA64-v4] Intel Corp., IA-64 Architecture Software Developers Manual, Volume 4: Itanium Processor Programmers Guide, Revision 1.1, July 2000 [IA64-Errata] Intel Corp., IA-64 Architecture Software Developers Manual, Specication Update, Revision 3.0, December 2000 [IA64-Asm] Intel Corp., Itanium Architecture Assembly Language Reference Guide, October 2000 [IA64-FP] Intel Corp., Itanium Processor Floating-point Software Assistance and Floating-point Exception Handling, January 2000 [IEEE-std] The Institute of Electrical and Electronics Engineers, Inc, IEEE Standard for Binary Floating-Point Arithmetic, IEEE Std 754-1085, 1985, 1990 [IEEE-tut] Sun Microsystems, Inc., Numerical Computation Guide, http://docs. sun.com/htmlcoll/coll.648.2/iso-8859-1/NUMCOMPGD/ncgTOC.html [LAR97] Fredrik Larsson, Generating Efcient Simulators from a Specication Language, January 1997
50
[PDF99] Adobe Systems Incorporated, Portable Document Format Reference Manual, version 1.3, March 11, 1999 [PEE99] Stefan Pees, Andreas Hoffmann, Vojin Zivojnovi , Heinrich Meyr, LISA c Machine Description Language for Cycle-Accurate Models of Programmable DSP Architectures, 1999 [RAJ98] V. Rajesh, A Generic Approach to Performance Modeling and Its Application to Simulator Generator, August 1998 [RAM97a] Norman Ramsey, Mary F. Fern ndez, Specifying Representations of Maa chine Instructions, May 1997 [RAM97b] Norman Ramsey, Mary Fern ndez, Automatic Checking of Instruction a Specications, 1997 [SHA86] Moe Shahdad, An overview of VHDL language and technology [SIE82] Daniel P. Siewiorek, C. Gordon Bell, Allen Newell, Computer Structures: Principles and Examples, 1982, ISBN 0-07-057302-6 [Simics UG] Virtutech AB, Simics User Guide, April 2001 [ZIV96] Vojin Zivojnovi , Stefan Pees, Heinreich Meyr, LISA Machine Description c Language and Generec Machine Model for HW/SW Cod-Design, October 1996 [1] Sim-nML homepage, http://www.cse.iitk.ac.in/sim-nml/index.cgi [2] Verilog FAQ, http://www.angelfire.com/in/verilogfaq/ [3] ghostscript homepage, http://www.cs.wisc.edu/ghost/ [4] xpdf homepage, http://www.foolabs.com/xpdf/ [5] indent homepage, http://www.gnu.org/software/indent/indent.html [6] softoat homepage, http://www.cs.berkeley.edu/jhauser/arithmetic/ softfloat.html
51

Guillaume Girard Report

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Guillaume Girard Report

Hochgeladen von

Copyright:

Verfügbare Formate

Automated testing using a reference instruction set simulator extracted from documentation

Guillaume GIRARD September 24, 2001

Virtutech and Simics

Aim of this thesis

Organization of the report

English, a common choice as a non-formal description language.

Hardware description languages

Usual processor descriptions

Instruction type A I M F B L+X

Description Integer ALU Non-ALU integer Memory Floating-point Branch Extended

Figure 3.2: System register model (from [IA64-v2] gure 31)

The integer execution units provide several types of instructions:

Logical instructions (and, or, and complement, xor). 13

p16 1 1 1 1 ... 1 ... 1 0 0 0 0

Table 3.3: Instruction trace for the rotating register loop.

Figure 3.5: Floating-point register format (from [IA64-v1], Figure 51)

Integer or parallel FP Pseudo-zeroes

NaTVal Zero FR0 FR1

If the exponent is zero, the following formula applies: 1

as an integer value, where the signicand contains the 64-bits integer.

Transfer between general and oating-point registers.

Status register instructions (check, set, clear).

Integer multiply and add instruction.

Virtual addressing and memory protection

Figure 3.6: A 64-bit pointer

Figure 3.8: Interruption classication (from [IA64-v2] Figure 51)

Debugging and performance monitoring

Extraction of information from the documentation

PDF and its shortcomings

which should obviously be read as:

Intel description format

Figure 4.1: ADD instructions description (from [IA64-v3] p. 23)

(qp) (qp) (qp) (qp) (qp)

register_form plus1_form, register_form pseudo-op imm14_form imm22_form

Interruptions: Illegal Operation fault

IA-64 Instruction Reference

Figure 4.2: ADD instructions description transformed by xpdf

IA-64 Documentation in PDF format

IA-64 Documentation in text format

Final C++ files

Human readable C++ files

Figure 4.3: The documentation extraction process

Figure 4.4: ADD instructions description transformed by the perl script

it transforms the bit operations in function calls. 33

The reference simulator

Other specific registers

Figure 5.1: Class hierarchy for register types

OneBitRegister denes a 1-bit register for predicate and NaT-bits.

FloatRegister denes a oating-point register of 82-bits.

GeneralRegister denes a 64-bits register with a NaT value.

Register framework (1500 lines) Pseudo-code functions (4750 lines)

Pseudo-code (10070 lines)

Decoder (3120 lines)

State file manipulation (800 lines)

Register Stack Engine

Translation Look-aside Buffers

the type floatx82 was dened.

the oating-point standard comparisons have been ported as well.

instruction performs the operation a b c , a b c , etc.

c in innite precision and round the nal result. Variants

Implementation dependent features

State saving and loading

Simics Simics module