Beruflich Dokumente
Kultur Dokumente
Level 6: The User Level o Program execution and user interface level. o The level with which we are most familiar. o Composed of application programs such as Word Processor,Paint etc. o The implementation of the application is hidden completely from the user Level 5: High-Level Language Level
The level with which we interact when we write programs in languages such as C, Pascal, Lisp, and Java o The level allows users to write their own application with languages such as C, Java and many more o High-level languages are easier to read, write, and maintain o User at this level sees very little of the lower level Level 4: Assembly Language Level o Acts upon assembly language produced from Level 5, as well as instructions programmed directly at this level. o Lowest human readable form before dealing with 1s and 0s (machine language) o Assembler converts assembly to machine language Level 3: System Software Level o Controls executing processes on the system. o Protects system resources. o Assembly language instructions often pass through Level 3 without modification. o Operating System software supervises other programs Controls execution of multiple programs Protects system resources. E.g. Memory and I/O devices o Other utilities Compilers, Interpreters, Linkers, Library etc. o The software can be written in both assembly and high-level language High-level is much more portable i.e. easier to modify to work on other machines Level 2: Machine Level o Also known as the Instruction Set Architecture (ISA) Level. o Consists of instructions that are particular to the architecture of the machine. o Programs written in machine language need no compilers, interpreters, or assemblers. o 2
Also known as the Instruction Set Architecture (ISA) Level o Consists of instructions that are particular to the architecture of the machine o Programs written in machine language (0s and 1s) need no compilers, interpreters, or assemblers Level 1: Control Level o A control unit decodes and executes instructions and moves data through the system. o Control units can be microprogrammed or hardwired. o A microprogram is a program written in a low-level language that is implemented by the hardware. o Hardwired control units consist of hardware that directly executes machine instructions. o Detailed organization of a processor implementation How the control unit interprets machine instructions (from fetch thru execute stages) o There can be different implementations of a single ISA o In the book this level is called Control level Level 0: Digital Logic Level o This level is where we find digital circuits (the chips). o Digital circuits consist of gates and wires. o These components implement the mathematical logic of all other levels. o This level is where we view physical devices as just switches (On/Off) o Instead of viewing their physical behavior (i.e. in terms of voltages and currents) we use two value logic i.e. 0 (off) and 1(on) o We will briefly look at the physical electronic components mainly the transistor technology Multi-level organization: summary o
Computers are designed as a series of levels Each level represent a different abstraction (hence a different language) The bottom level is the actual computer and its (real) machine language (low-level language) The top-level is for High-Level Languages (C,C++,Java, Prolog) easier for the final user The set of data types and operations of each level is called an architecture. Choosing data types and operations for each level is a fundamental part of computer architecture design.
Levels of Representation
temp = v[k]; High Level Language Program Compiler Assembly Language Program Assembler Machine Language Program
0000 1010 1100 0101
1001 1111 0110 1000
lw lw sw sw
1100 0101 1010 0000
Microarchitecture
Microarchitecture is the term used to describe the resources and methods used to achieve architecture specification. The term typically includes the way in which these resources are organized as well as the design techniques used in the processor to reach the target cost and performance goals. The microarchitecture essentially forms a specification for the logical implementation.
In computer engineering, microarchitecture (sometimes abbreviated to arch or uarch) is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer Architecture is the combination of microarchitecture and instruction set design.
testability
Complicating this simple-looking series of steps is the fact that the memory hierarchy, which includes caching, main memory and nonvolatile storage like hard disks, (where the program instructions and data reside) has always been slower than the processor itself. Step (2) often introduces a lengthy (in CPU terms) delay while the data arrives over the computer bus. A considerable amount of research has been put into designs that avoid these delays as much as possible. Over the years, a central goal was to execute more instructions in parallel, thus increasing the effective execution speed of a program. These efforts introduced complicated logic and circuit structures. Initially these techniques could only be implemented on expensive mainframes or supercomputers due to the amount of circuitry needed for these techniques. As semiconductor manufacturing progressed, more and more of these techniques could be implemented on a single semiconductor chip. What follows is a survey of micro-architectural techniques that are common in modern CPUs.
chip area/cost power consumption logic complexity ease of connectivity manufacturability ease of debugging
that are dealing with data parallelism include SIMD and Vectors. Some labels used to denote classes of CPU architectures are not particularity descriptive, especially so the CISC label; many early designs, retroactively denoted "CISC" are in fact significantly simpler than modern RISC processors (in several respects). However, the choice of instruction set architecture may greatly affect the complexity of implementing high performance devices. The prominent strategy, used to develop the first RISC processors, was to simplify instructions to a minimum of individual semantic complexity combined with high encoding regularity and simplicity. Such uniform instructions were easily fetched, decoded and executed in a pipelined fashion and a simple strategy to reduce the number of logic levels in order to reach high operating frequencies; instruction cache-memories compensated for the higher operating frequency and inherently low code density while large register sets were used to factor out as much of the (slow) memory accesses as possible.
RISC make pipelines smaller and much easier to construct by cleanly separating each stage of the instruction process and making them take the same amount of time one cycle. The processor as a whole operates in an assembly line fashion, with instructions coming in one side and results out the other. Due to the reduced complexity of the Classic RISC pipeline, the pipelined core and an instruction cache could be placed on the same size die that would otherwise fit the core alone on a CISC design. This was the real reason that RISC was faster. Early designs like the SPARC and MIPS often ran over 10 times as fast as Intel and Motorola CISC solutions at the same clock speed and price. Pipelines are by no means limited to RISC designs. By 1986 the top-ofthe-line VAX implementation (VAX 8800) was a heavily pipelined design, slightly predating the first commercial MIPS and SPARC designs. Most modern CPUs (even embedded CPUs) are now pipelined, and microcoded CPUs with no pipelining are seen only in the most areaconstrained embedded processors. Large CISC machines, from the VAX 8800 to the modern Pentium 4 and Athlon, are implemented with both microcode and pipelines. Improvements in pipelining and caching are the two major microarchitectural advances that have enabled processor performance to keep pace with the circuit technology on which they are based.
One of the first, and most powerful, techniques to improve performance is the use of the instruction pipeline. Early processor designs would carry out all of the steps above for one instruction before moving onto the next. Large portions of the circuitry were left idle at any one step; for instance, the instruction decoding circuitry would be idle during execution and so on. Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast. Although any one instruction takes just as long to complete (there are still four steps) the CPU as a whole "retires" instructions much faster and can be run at a much higher clock speed.
[edit] Cache
It was not long before improvements in chip manufacturing allowed for even more circuitry to be placed on the die, and designers started looking for ways to use it. One of the most common was to add an ever-increasing amount of cache memory on-die. Cache is simply very fast memory, memory that can be accessed in a few cycles as opposed to "many" needed to talk to main memory. The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears," whereas if it is not the processor is "stalled" while the cache controller reads it in.
RISC designs started adding cache in the mid-to-late 1980s, often only 4 KB in total. This number grew over time, and typical CPUs now have at least 512 KB, while more powerful CPUs come with 1 or 2 or even 4, 6, 8 or 12 MB, organized in multiple levels of a memory hierarchy. Generally speaking, more cache means more performance, thanks to reduced stalling. Caches and pipelines were a perfect match for each other. Previously, it didn't make much sense to build a pipeline that could run faster than the access latency of off-chip memory. Using on-chip cache memory instead, meant that a pipeline could run at the speed of the cache access latency, a much smaller length of time. This allowed the operating frequencies of processors to increase at a much faster rate than that of off-chip memory.
[edit] Superscalar
Even with all of the added complexity and gates needed to support the concepts outlined above, improvements in semiconductor manufacturing soon allowed even more logic gates to be used. In the outline above the processor processes parts of a single instruction at a time. Computer programs could be executed faster if multiple instructions were processed simultaneously. This is what superscalar processors achieve, by replicating functional units such as ALUs. The replication of functional units was only made possible when the die area of a single-issue processor no longer stretched the limits of what could be reliably manufactured. By the late 1980s, superscalar designs started to enter the market place. In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort. The instruction issue logic grows in complexity by reading in a huge list of instructions from memory and handing them off to the different execution units that are idle at that point. The results are then collected and reordered at the end.
Computer architects have become stymied by the growing mismatch in CPU operating frequencies and DRAM access times. None of the techniques that exploited instruction-level parallelism within one program could make up for the long stalls that occurred when data had to be fetched from main memory. Additionally, the large transistor counts and high operating frequencies needed for the more advanced ILP techniques required power dissipation levels that could no longer be cheaply cooled. For these reasons, newer generations of computers have started to exploit higher levels of parallelism that exist outside of a single program or program thread. This trend is sometimes known as throughput computing. This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with massive numbers of transactions. With transactionbased applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues. One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. Once reserved for high-end mainframes and supercomputers, small scale (2-8) multiprocessors servers have become commonplace for the small business market. For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s. With further transistor size reductions made available with semiconductor technology advances, multicore CPUs have appeared where multiple CPUs are implemented on the same silicon chip. Initially used in chips targeting embedded markets, where simpler and smaller CPUs would allow multiple instantiations to fit on one piece of silicon. By 2005, semiconductor technology allowed dual high-end desktop CPUs CMP chips to be manufactured in volume. Some designs, such as Sun Microsystems' UltraSPARC T1 have reverted back to simpler (scalar, inorder) designs in order to fit more processors on one piece of silicon.
8
Register renaming refers to a technique used to avoid unnecessary serialized execution of program instructions because of the reuse of the same registers by those instructions. Suppose we have two groups of instruction that will use the same register, one set of instruction is executed first to leave the register to the other set, but if the other set is assigned to a different similar register both sets of instructions can be executed in parallel.
Another technique that has become more popular recently is multithreading. In multithreading, when the processor has to fetch data from slow system memory, instead of stalling for the data to arrive, the processor switches to another program or program thread which is ready to execute. Though this does not speed up a particular program/thread, it increases the overall system throughput by reducing the time the CPU is idle. Conceptually, multithreading is equivalent to a context switch at the operating system level. The difference is that a multithreaded CPU can do a thread switch in one CPU cycle instead of the hundreds or thousands of CPU cycles a context switch normally requires. This is achieved by replicating the state hardware (such as the register file and program counter) for each active thread.
Of all the above the most distinguishing factor is the first. The 3 most common types of ISAs are:
1. Stack - The operands are implicitly on top of the stack. 2. Accumulator - One operand is implicitly the accumulator. 3. General Purpose Register (GPR) - All operands are explicitely mentioned, they are either registers or memory locations.
in all 3 architectures:
Stack PUSH A PUSH B Accumulat or LOAD A ADD B LOAD R1,A ADD R1,B GPR
Not all processors can be neatly tagged into one of the above catagories. The i8086 has many instructions that use implicit operands although it has a general register set. The i8051 is another example, it has 4 banks of GPRs but most instructions must have the A register as one of its operands. What are the advantages and disadvantages of each of these approachs?
Stack
Advantages: Simple Model of expression evaluation (reverse polish). Short instructions.
9
Disadvantages: A stack can't be randomly accessed This makes it hard to generate eficient code. The stack itself is accessed every operation and becomes a bottleneck.
Accumulator
Advantages: Short instructions. Disadvantages: The accumulator is only temporary storage so memory traffic is the highest for this approach.
number of cycles it takes to access memory varies so does the whole instruction. This isn't good for compiler writers, pipelining and multiple issue. 3. Most ALU instruction had only 2 operands where one of the operands is also the destination. This means this operand is destroyed during the operation or it must be saved before somewhere.
GPR
Advantages: Makes code generation easy. Data can be stored for long periods in registers. Disadvantages: All operands must be named leading to longer instructions. Earlier CPUs were of the first 2 types but in the last 15 years all CPUs made are GPR processors. The 2 major reasons are that registers are faster than memory, the more data that can be kept internaly in the CPU the faster the program wil run. The other reason is that registers are easier for a compiler to use.
Thus in the early 80's the idea of RISC was introduced. The SPARC project was started at Berkeley and the MIPS project at Stanford. RISC stands for Reduced Instruction Set Computer. The ISA is composed of instructions that all have exactly the same size, usualy 32 bits. Thus they can be pre-fetched and pipelined succesfuly. All ALU instructions have 3 operands which are only registers. The only memory access is through explicit LOAD/STORE instructions. Thus A = B + C will be assembled as:
LOAD LOAD ADD STORE R1,A R2,B R3,R1,R2 C,R3
Although it takes 4 instructions we can reuse the values in the registers. Why is this architecture called RISC? What is Reduced about it? The answer is that to make all instructions the same length the number of bits that are used for the opcode is reduced. Thus less instructions are provided. The instructions that were thrown out are the less important string and BCD (binary-coded decimal) operations. In fact, now that memory access is restricted there aren't several kinds of MOV or ADD instructions. Thus the older architecture is called CISC (Complete Instruction Set Computer). RISC architectures are also called LOAD/STORE architectures. The number of registers in RISC is usualy 32 or more. The first RISC CPU the MIPS 2000 has 32 GPRs as opposed to 16 in the 68xxx architecture and 8 in the 80x86 architecture. The only disadvantage of RISC is its code size. Usualy more instructions are needed and there is a waste in short instructions (POP, PUSH).
10
So why are there still CISC CPUs being developed? Why is Intel spending time and money to manufacture the Pentium II and the Pentium III? The answer is simple, backward compatibility. The IBM compatible PC is the most common computer in the world. Intel wanted a CPU that would run all the applications that are in the hands of more than 100 million users. On the other hand Motorola which builds the 68xxx series which was used in the Macintosh made the transition and together with IBM and Apple built the Power PC (PPC) a RISC CPU which is installed in the new Power Macs. As of now Intel and the PC manufacturers are making more money but with Microsoft playing in the RISC field as well (Windows NT runs on Compaq's Alpha) and with the promise of Java the future of CISC isn't clear at all. An important lesson that can be learnt here is that superior technology is a factor in the computer industry, but so are marketing and price as well (if not more). An instruction set is a list of all the instructions, and all their variations, that a processor (or in the case of a virtual machine, an interpreter) can execute. Instructions include:
Instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the Intel Pentium and the AMD Athlon implement nearly identical versions of the x86 instruction set, but have radically different internal designs. This concept can be extended to unique ISAs like TIMI (TechnologyIndependent Machine Interface) present in the IBM System/38 and IBM AS/400. TIMI is an ISA that is implemented as low-level software and functionally resembles what is now referred to as a virtual machine. It was designed to increase the longevity of the platform and applications written for it, allowing the entire platform to be moved to very different hardware without having to modify any software except that which comprises TIMI itself. This allowed IBM to move the AS/400 platform from an older CISC architecture to the newer POWER architecture without having to recompile any parts of the OS or software associated with it. Nowadays there are several open source Operating Systems which could be easily ported on any existing general purpose CPU, because the compilation is the essential part of their design (e.g. new software installation).
Machine language
Machine language is built up from discrete statements or instructions. On the processing architecture, a given instruction may specify:
Arithmetic such as add and subtract Logic instructions such as and, or, and not Data instructions such as move, input, output, load, and store Control flow instructions such as goto, if ... goto, call, and return.
An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine language), the native commands implemented by a particular processor.
Particular registers for arithmetic, addressing, or control functions Particular memory locations or offsets Particular addressing modes used to interpret the operands
More complex operations are built up by combining these simple instructions, which (in a von Neumann machine) are executed sequentially, or as otherwise directed by control flow instructions. Some operations available in most instruction sets include:
11
moving o set a register (a temporary "scratchpad" location in the CPU itself) to a fixed constant value o move data from a memory location to a register, or vice versa. This is done to obtain the data to perform a computation on it later, or to store the result of a computation. o read and write data from hardware devices computing o add, subtract, multiply, or divide the values of two registers, placing the result in a register o perform bitwise operations, taking the conjunction/disjunction (and/or) of corresponding bits in a pair of registers, or the negation (not) of each bit in a register o compare two values in registers (for example, to see if one is less, or if they are equal) affecting program flow o jump to another location in the program and execute instructions there o jump to another location if a certain condition holds o jump to another location, but save the location of the next instruction as a point to return to (a call)
instructions that combine ALU with an operand from memory rather than a register
A complex instruction type that has become particularly popular recently is the SIMD or Single-Instruction Stream Multiple-Data Stream operation or vector instruction, an operation that performs the same arithmetic operation on multiple pieces of data at the same time. SIMD have the ability of manipulating large vectors and matrices in minimal time. SIMD instructions allow easy parallelization of algorithms commonly involved in sound, image, and video processing. Various SIMD implementations have been brought to market under trade names such as MMX, 3DNow! and AltiVec. The design of instruction sets is a complex issue. There were two stages in history for the microprocessor. The first was the CISC (Complex Instruction Set Computer) which had many different instructions. In the 1970s, however, places like IBM did research and found that many instructions in the set could be eliminated. The result was the RISC (Reduced Instruction Set Computer), an architecture which uses a smaller set of instructions. A simpler instruction set may offer the potential for higher speeds, reduced processor size, and reduced power consumption. However, a more complex set may optimize common operations, improve memory/cache efficiency, or simplify programming.
Some computers include "complex" instructions in their instruction set. A single "complex" instruction does something that may take many instructions on other computers. Such instructions are typified by instructions that take multiple steps, control multiple functional units, or otherwise appear on a larger scale than the bulk of simple instructions implemented by the given processor. Some examples of "complex" instructions include:
saving many registers on the stack at once moving large blocks of memory complex and/or floating-point arithmetic (sine, cosine, square root, etc.) performing an atomic test-and-set instruction
each instruction of an ISA using this physical microarchitecture. There are two basic ways to build a control unit to implement this description (although many designs use middle ways or compromises):
1. Early computer designs and some of the simpler RISC computers "hard-wired" the complete instruction set decoding and sequencing (just like the rest of the microarchitecture). 2. Other designs employ microcode routines and/or tables to do thistypically as on chip ROMs and/or PLAs (although separate RAMs have been used historically).
implemented using a kind of Harvard architecture that can fetch an instruction and two data words simultaneously, and it requires a singlecycle multiply-accumulate multiplier.
There are also some new CPU designs which compile the instruction set to a writable RAM or FLASH inside the CPU (such as the Rekursiv processor and the Imsys Cjip)[1], or an FPGA (reconfigurable computing). The Western Digital MCP-1600 is an older example, using a dedicated, separate ROM for microcode. An ISA can also be emulated in software by an interpreter. Naturally, due to the interpretation overhead, this is slower than directly running programs on the emulated hardware, unless the hardware running the emulator is an order of magnitude faster. Today, it is common practice for vendors of new ISAs or microarchitectures to make software emulators available to software developers before the hardware implementation is ready. Often the details of the implementation have a strong influence on the particular instructions selected for the instruction set. For example, many implementations of the instruction pipeline only allow a single memory load or memory store per instruction, leading to a load-store architecture (RISC). For another example, some early ways of implementing the instruction pipeline led to a delay slot. The demands of high-speed digital signal processing have pushed in the opposite directionforcing instructions to be implemented in a particular way. For example, in order to perform digital filters fast enough, the MAC instruction in a typical digital signal processor (DSP) must be
or three operands (including the result) directly in memory or may be able to perform functions such as automatic pointer increment etc. Softwareimplemented instruction sets may have even more complex and powerful instructions. Reduced instruction-set computers, RISC, were first widely implemented during a period of rapidly-growing memory subsystems and sacrifice code density in order to simplify implementation circuitry and thereby try to increase performance via higher clock frequencies and more registers. RISC instructions typically perform only a single operation, such as an "add" of registers or a "load" from a memory location into a register; they also normally use a fixed instruction width, whereas a typical CISC instruction set has many instructions shorter than this fixed length. Fixedwidth instructions are less complicated to handle than variable-width instructions for several reasons (not having to check whether an instruction straddles a cache line or virtual memory page boundary[2] for instance), and are therefore somewhat easier to optimize for speed. However, as RISC computers normally require more and often longer instructions to implement a given task, they inherently make less optimal use of bus bandwidth and cache memories. Minimal instruction set computers (MISC) are a form of stack machine, where there are few separate instructions (16-64), so that multiple instructions can be fit into a single machine word. These type of cores often take little silicon to implement, so they can be easily realized in an FPGA or in a multi-core form. Code density is similar to RISC; the increased instruction density is offset by requiring more of the primitive instructions to do a task. There has been research into executable compression as a mechanism for improving code density. The mathematics of Kolmogorov complexity describes the challenges and limits of this.
instructions. This does not refer to the arity of the operators, but to the number of operands explicitly specified as part of the instruction. Thus, implicit operands stored in a special-purpose register or on top of the stack are not counted. (In the examples that follow, a, b, and c refer to memory addresses, and reg1 and so on refer to machine registers.)
0-operand ("zero address machines") -- these are also called stack machines, and all operations take place using the top one or two positions on the stack. Add two numbers in five instructions: #a, load, #b, load, add, #c, store; 1-operand ("one address machines") -- often called accumulator machines -- include most early computers. Each instruction performs its operation using a single operand specifier. The single accumulator register is implicitsource, destination, or often bothin almost every instruction: load a, add b, store c; 2-operandmany RISC machines fall into this category, though many CISC machines also fall here as well. For a RISC machine (requiring explicit memory loads), the instructions would be: load a, reg1; load b, reg2; add reg1,reg2; store reg2,c; 3-operand CISCsome CISC machines fall into this category. The above example here might be performed in a single instruction in a machine with memory operands: add a, b,c, or more typically (most machines permit a maximum of two memory operations even in threeoperand instructions): move a, reg1; add reg1,b, c; 3-operand RISCmost RISC machines fall into this category, because it allows "better reuse of data"[2]. In a typical three-operand RISC machines, all three operands must be registers, so explicit load/store instructions are needed. An instruction set with 32 registers requires 15 bits to encode three register operands, so this scheme is typically limited to instructions sets with 32-bit instructions or longer. Example: load a, reg1; load b, reg2; add reg1+reg2->reg3; store reg3,c; 14
more operandssome CISC machines permit a variety of addressing modes that allow more than 3 operands (registers or memory accesses), such as the VAX "POLY" polynomial evaluation instruction.
Transputer UNIVAC 1100/2200 series x86 o IA-32 (i386, Pentium, Athlon) o x86-64 (64-bit superset of IA-32) EISC (AE32K)
p-Code (UCSD p-System Version III on Western Digital Pascal MicroEngine) Java virtual machine (ARM Jazelle, picoJava, JOP) FORTH
Alpha ARM Burroughs B5000/B6000/B7000 series IA-64 (Itanium) Mico32 MIPS Motorola 68k PA-RISC IBM 700/7000 series o System/360 o System/370 o System/390 o z/Architecture Power Architecture o POWER o PowerPC PDP-11 o VAX SPARC SuperH Tricore
ISAC - Instruction Set Architecture C - A "C" Back-end outputs "C" code instead of assembly language - Very few
examples are: ISAC and Lissom . The LLVM Compiler Infrastructure implements a C and C++ back-end.
ALGOL object code SECD machine, a virtual machine used for some functional programming languages. MMIX, a teaching machine used in Donald Knuth's The Art of Computer Programming Z-machine, a virtual machine originated by Infoc
It is unfortunate that the term MIPS is used as a processor benchmark as well as a shorthand form of a company name, so first I better make the distinction clear. The company responsible for the CPU designs in the N64 is MTI, an abbreviation for MIPS Technologies Inc., but many people (incuding myself) refer to the company as just MIPS. MIPS (the benchmark): The processor benchmark called MIPS has nothing to do with the company name. In the context of CPU performance measurement, MIPS stands for 'Million Instructions Per Second' and is probably the most useless benchmark ever invented. The rest of this page concerns MIPS as a benchmark, not the company (also discussed here are the MFLOPS and SPEC benchmarks, plus a comment on memory bandwidth). The MIPS rating of a CPU refers to how many low-level machine code instructions a processor can execute in one second. Unfortunately, using this number as a way of measuring processor performance is completely pointless because no two chips use exactly the same kind of instructions, execution method, etc. For example: on one chip, a single instruction may do many things when executed (CISC = Complex Instruction Set Computing), whereas on another chip a single instruction may do very little but is dealt with more efficiently (RISC = Reduced Instruction Set Computing). Also, different instructions on the same chip often do vastly different amounts of work (eg. a simple arithmetic instruction might take just 1 clock cycle to complete, whereas doing something like floating point division or a square root operation might take 20 to 50 clock cycles). People who design processors and people like me who are interested in how they work, etc., almost never use a processor's 'MIPS' rating when discussing performance because it's effectively meaningless (like many people, I did at one time used to think that a CPU's MIPS rating was all important. Ironically, an employee at MIPS Technologies Inc. corrected my faulty beliefs when I asked him about the performance of the R8000.
MIPS numbers are often very high because of how processors work, but in fact the number tells one absolutely nothing about what the processor can actually do or how it works (ie. a processor with a lower MIPS rating may actually be a better chip because its instructions are doing more work per clock cycle). There are dozens of different processor and system benchmarks, such as SPEC, Linpack, MFLOP, STREAM, Viewperf, etc. One should always use the test that is most relevant to one's area of interest and the system concerned. With games consoles, however, this is a bit of a problem because no one has yet made a 'games console' benchmark test - people have to use existing benchmarks which were never designed for the job. An example: imagine a 32bit processor running at 400MHz. It might be rated at 400MIPS. Now consider a 64bit processor running at 200MHz. It might be rated at 200MIPS (assume a simple design in each case). But suppose my task involves 64bit fp processing (eg. computational fluid dynamics, or audio processing, etc.): the 32bit processor would take many more clock cycles to complete a single 64bit fp multiply since its registers are only of a 32bit word length. The 32bit CPU would take at least twice as long to carry out such an operation. Thus, for 64bit operations, the 32bit processor would be much slower than the 64bit processor. Now think of it the other way round: suppose one's task only involved 32bit operations. Unless the 64bit registers in the 64bit CPU could be treated as two 32bit registers, the 32bit CPU would be much faster. It all depends on the processing requirements. The situation in real life is far more complicated though, because real CPUs rarely do one thing at a time and in just one clock cycle. Simple arithmetic operations may take 1 cycle, an integer multiplty might take 2 cycles, a fp multiply might take 5 clock cycles, a complex square root operation in a CISC design take 20 cycles, and so on. Worse, some CPUs are designed to do more than one of the same kind of operation at once, ie. they have more than one of a particular kind of processing unit. CPUs such as SGI's R10000 series (or later equivalents), the HP PA8000 series, the old Alpha 21x64 series, etc. often have 2 or more integer processing units, multiple fp processing units and at least one load/store unit.
16
Sometimes, they may have special units too, for example to accelerate square root calculuations. But it doesn't stop there! Today, there are technologies such as MMX (from Intel) which is designed to allow a 64bit integer register to be treated as multiple 32bit, 16bit or 8bit integer registers, and also MDMX (from MIPS Technologies Inc.) which does the same but is more powerful in that it also allows the same register splitting to be done with fp registers and includes a 192bit accumulator register, although at present SGI hasn't implemented MDMX in any of their available CPUs. These new ideas enable many more calculations to be performed in the same amount of time compared to older designs. An example: Gouraud shading involves 32bit fp operations; using a 64bit fp register as two 'separate' 32bit fp registers will (at best) double the processing ability of the CPU. So that's the MIPS benchmark dealt with, ie. it's useless, so ignore it. Since I mentioned fp calculations, that leads nicely onto the MFLOPS benchmark. MFLOPS People often mean MFLOPS to mean different things, but a general definition would be the number of full word-size fp multiply operations that can be performed per second (the M stands for 'Million'). Obviously, fp add or subtract operations take less time and slowest of all is fp divide. Older CPUs take many clock cycles to complete one FLOP and so, even at a high clock speeds, their FLOP rate can be low. An example is the 486DX4/100 which is rated at about 6MFLOPS. Compare this to the 200MHz R4400 which is rated at about 35MFLOPS. For older processors, clock speed is clearly no indication of MFLOP rate. Newer designs don't mean things become clearer - if anything the situation is more complex, since the situation is often the reverse: CPUs like the R10000 can do two fp operations each clock cycle, giving it a rating of 400MFLOPS at 200MHz. The R8000 is even more confusing since it has two fp execution units, each capable of doing two fp
ops/clock, giving it a rating of 360MFLOPS at 90MHz! (that's ten times faster than an Intel P90). Again, the nature of the task is important. A 64bit CPU that can do 400MFLOPS may be fine, but if one's work only needs 32bit processing then much of the CPU's capabilities are being wasted. CPUs like the R5000 address this problem, aiming at markets that do not need 64bit floating point (fp for short) processing. Future designs like MDMX will solve the wastage problem, but it will also make the measuring of CPU performance even harder. Perhaps CPU capability is a better metric, but no one has devised such a test yet. There are just a wide variety of benchmarks and one must use the most appropriate test as a basis for decision making. All this talk of MFLOPS is fine, but it misses one very important point: memory bandwidth. A fast CPU may sound impressive, and PR people will always talk in terms of theoretical peak performance, etc., but in reality a CPU's best possible performance totally depends on the rate at which it can access data from the various kinds of memory (L1 / L2 cache and main RAM). A fast CPU in a system with low memory bandwidth will not perform anywhere near its theoretical peak (eg. 500MHz Alpha). I have studied the effect of this on the 195MHz R10000 and the results are very interesting. If you want to know more about the whole issue of memory bandwidth, then see STREAM Memory Bandwidth site. What is important here with regard to the N64 is that SGI have given it a very high memory bandwidth indeed (500MB/sec peak, ie. almost 4 times faster than PCI). The N64's memory design uses Rambus technology, which is also used in SGI's IMPACT graphics technology. SPEC I've always found it funny (or annoying, depending on my mood at the time :) that gamers try and use the SPEC benchmark when arguing about the performance of main controller CPUs in games consoles. What gamers should understand is that SPEC's main CPU test suite (currently
17
SPEC2000) was never designed for games consoles; SPEC is not a graphics test, or a test designed to measure the preprocessing tasks often needed for 3D graphics, such as excessive MADD (multiply-add) operations, although some individual tests in the test suite may be similar. SPEC (92, 95 or 2000) is an integer/floating point test package consisting of a number of separate integer and fp tests which are supposed to be run on systems with at least 32MB (SPEC92) or 64MB (SPEC95) of RAM (perhaps more for SPEC2000). Thus, it's fine to discuss theoretical SPEC performance of a CPU like the R4300i, but in the context of a games console it's completely meaningless. This is why you won't find my N64 tech info site using the theoretical SPEC numbers for the R4300i in my own discussions because they are useless when talking about the N64's real capabilities. While I'm on the subject, please note the following:
One cannot convert SPEC92 results to SPEC95 results in any kind of linear manner. The same applies to SPEC2000. All SPEC test suites consist of a weighted average of many different tests and as such any wide variance in test results can be seriously hidden (as is the case with the R5000). SPEC focuses more on scientific and engineering applications - such tasks are not like 3D graphics and so are not helpful when judging CPU performance (many SPEC tests involve 64bit operations, whereas geometry and lighting calculations in a games console will usually be 32bit operations).
Operating System:
An operating system, or OS, is a software program that enables the computer hardware to communicate and operate with the computer software. Without a computer operating system, a computer would be useless. Operating system types As computers have progressed and developed so have the types of operating systems. Below is a basic list of the
different types of operating systems and a few examples of operating systems that fall into each of the categories. Many computer operating systems will fall into more than one of the below categories. GUI - Short for Graphical User Interface, a GUI Operating System contains graphics and icons and is commonly navigated by using a computer mouse. See our GUI dictionary definition for a complete definition. Below are some examples of GUI Operating Systems. System 7.x Windows 98 Windows CE Multi-user - A multi-user operating system allows for multiple users to use the same computer at the same time and/or different times. See our multi-user dictionary definition for a complete definition for a complete definition. Below are some examples of multi-user operating systems. Linux Unix Windows 2000 Multiprocessing - An operating system capable of supporting and utilizing more than one computer processor. Below are some examples of multiprocessing operating systems. Linux Unix Windows 2000 Multitasking - An operating system that is capable of allowing multiple software processes to run at the same time. Below are some examples of multitasking operating systems. Unix Windows 2000
18
Multithreading - Operating systems that allow different parts of a software program to run concurrently. Operating systems that would fall into this category are: Linux Unix Windows 2000
Following are the five services provided by an operating systems to the convenience of the users.
file and sees his her task accomplished. Thus operating systems makes it easier for user programs to accomplished their task. This service involves secondary storage management. The speed of I/O that depends on secondary storage management is critical to the speed of many programs and hence I think it is best relegated to the operating systems to manage it than giving individual users the control of it. It is not difficult for the user-level programs to provide these services but for above mentioned reasons it is best if this service s left with operating system.
Communications
There are instances where processes need to communicate with each other to exchange information. It may be between processes running on the same computer or running on the different computers. By providing this service the operating system relieves the user of the worry of passing messages between processes. In case where the messages need to be passed to processes on the other computers through a network it can be done by the user programs. The user program may be customized to the specifics of the hardware through which the message transits and provides the service interface to the operating system.
Program Execution
The purpose of a computer systems is to allow the user to execute programs. So the operating systems provides an environment where the user can conveniently run programs. The user does not have to worry about the memory allocation or multitasking or anything. These things are taken care of by the operating systems. Running a program involves the allocating and deallocating memory, CPU scheduling in case of multiprocess. These functions cannot be given to the user-level programs. So user-level programs cannot help the user to run programs independently without the help from operating systems.
I/O Operations
Each program requires an input and produces output. This involves the use of I/O. The operating systems hides the user the details of underlying hardware for the I/O. All the user sees is that the I/O has been performed without any details. So the operating systems by providing I/O makes it convenient for the users to run programs. For efficiently and protection users cannot control I/O so this service cannot be provided by user-level programs.
Error Detection
An error is one part of the system may cause malfunctioning of the complete system. To avoid such a situation the operating system constantly monitors the system for detecting the errors. This relieves the user of the worry of errors propagating to various part of the system and causing malfunctioning. This service cannot allowed to be handled by user programs because it involves monitoring and in cases altering area of memory or deallocation of memory for a faulty process. Or may be relinquishing the CPU of a process that goes into an infinite loop. These tasks are too critical to be handed over to the user programs. A user program if given these privileges can interfere with the correct (normal) operation of the operating systems.
A computer system is an electronic device which can input, process, and output data
Examples: Windows, UNIX and Macintosh 2. System support software Examples: disk-formatting and anti-virus programs. 3. System development software.
Input data of a computer may represent numbers, words, pictures etc Programs that control the operations of the computer are stored inside the computer
Example: Language translators. Applications Software An applications software enables a computer user to do a particular task Example applications software include: Word processors Game programs Spreadsheets (or Excel sheets) Database systems Graphics programs Multimedia applications Computer Hardware
Systems Software System software manages computer resources and makes computers easier to use Systems software can be divided into three categories: 1. Operating System (OS)
I/O (Input/Output)Devices
Input devices are used to enter programs and data into a computer. Examples: keyboard, mouse, microphone, and scanner. Output devices are where program output is shown or is sent. Examples: monitor, printer, and speaker. I/O devices are slow compared to the speed of the processor. 20
Computer memory is faster than I/O devices: speed of input from memory to processor is acceptable. Computer Memory The main function of computer memory is to store software. Computer memory is divided into primary memory and secondary memory. Primary memory is divided into random access memory (RAM) and read-only memory (ROM): The CPU can read and write to RAM but the CPU can read from ROM but cannot write to ROM RAM is volatile while ROM is not. Secondary memory Examples of secondary memory devices are: hard disks, floppy disks and CD ROMs Primary Memory
************
The CPU The CPU is the "brain" of the computer system. The CPU directly or indirectly controls all the other components. 21