Beruflich Dokumente
Kultur Dokumente
Lecture Notes
COMPS266F
Computer Architecture
2017 Presentation
Copyright © Andrew Kwok-Fai Lui 2017
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
The computer architecture part of the course will begin with solving the problem of how to
design a programmable computer. We will begin with explaining the meaning of a
programmable computer.
Computers or computer systems are not necessarily referring to the computer you have on
your desktop. Computers are devices that can calculate.
The word computing originally means calculating. As with many other English words, the
meaning of computing changes with the development of the society and the technology.
• In the modern era, computing may refer to anything from day-to-day operations of
financial institutions, creating documents and spreadsheets, programming, data mining,
statistical analysis, controlling a spaceship, etc. Modern computers are electronic devices
that you can purchase from a computer shop. They allow you to play computer games,
write documents, carry out financial planning, and talk to friends across the world.
• In the old days, computers were often mechanical devices. Abacus is an example of a
computer that was great for simple arithmetic. For example, Wilhelm Schickard
invented a digital mechanical calculator in the 1623 that used metal gears and levers. He
was known as the father of the computing era. Mechanical computers were becoming
common until 1930 to 1940, when electronic devices were used to build computing
systems. The Electronic Numerical Integrator and Computer (ENIAC), one of the first
computers, was designed for ballistic calculation using over 17,000 vacuum tubes.
ENIAC was specially built for this particular purpose but it could be partially re-purpose-
able by rewiring.
References:
http://en.wikipedia.org/wiki/History_of_computer
About the history of development of computer systems. Read the story about how the
British invented the first modern computer but they were forbidden to reveal it because
it was a wartime secret.
The following shows a typical computing process. This process reads input data, processes
the data, and then writes output data. The process has access to a data storage which can be
used to store data for future use. The stored data can turn back and influence the process.
2
2. Programmable computers
The following gives examples of programmable systems from low programmability to high
programmability.
• A toaster with a time knob
• A washing machine with programs for various types of clothing.
• A DVD recorder supporting various recording modes.
• A programmable calculator supporting programmed sequences of calculation steps.
• An Excel spreadsheet supporting functions and macros.
• A modern general purpose computer system
3
3. Components of a Programmable Computer
These items will perform some processes. The major processes are listed below:
• instruction execution: an essential function of the programmable computer
• data storage: a function for storing the data before and after the instruction execution
• program storage: a function for storing the program in the programmable computer for
instruction execution
• inputting data and program: a function for data and program to go into the computer from
the outside world
• outputting data: a function for data to leave the computer to the outside world
The last two processes are essential because a computer cannot exist in isolation. A computer
useful for any purpose must be able to interact with the outside world.
These processes are refined and their roles are abstracted into the following major
components for a programmable computer.
• Arithmetic and Logic Unit (ALU): for instruction execution
• Memory system: for data and program storage
• Input: for data input into the programmable computer
• Output: for data output from the programmable computer
4
4. Introduction to Arithmetic and Logic Unit
The first component of a programmable computer is the Arithmetic and Logic Unit (ALU).
The ALU is a functional unit responsible for the execution of instructions.
• The execution of instructions is an essential function of a programmable computer.
• The input to the ALU includes the data and the instructions that command how to deal
with the data.
• The result of the instruction execution will appear at the output of the ALU.
The following figure shows a schematic diagram of ALU with its input and output. One input
channel is for sending in instructions and others for sending in data. The ALU can typically
many types of instructions, for example, add, subtract, and negation.
In the figure, the ALU has two data input channels and one data output channel. This is a
typical arrangement because most operations (instructions) have at most two operands.
• Addition: A + B. A and B are passed into the two input channels and the result of A+B
will appear at the output channel.
• Subtraction: A – B. The same case as addition.
• Negation: -A. This is a single operand operation. A is passed into one input channel.
The operation of the ALU is controlled by the instruction. For example, an Addition
instruction will make the ALU performing an addition operation on the input data. The ALU
will output data to inform other components of its status. For example, if an error occurs in
the calculation, then an error status may be emitted.
Different instructions are represented by different electronic signals which in turn
representing numbers. The ALU designer may specify that 01 representing Addition and 02
representing Subtraction. The coding of instruction is usually published in a technical
manual.
The ALU does not carry out any operation unless it is told to do so. The clock line connected
to the ALU sends regular signal to the ALU, in a way similar to the alarm "beep beep beep"
sound. Upon receiving a beep sound, the ALU executes one instruction. Then it executes
another instruction when the second beep sound arrives.
5
5. Memory
The second component is Memory. In a programmable computer, memory is a component
supporting the following functions:
• Store data (Write)
• Retrieve data (Read)
• Overwrite a previously stored data (Overwrite)
The left figure shows a schematic diagram of Memory with its input and output. There is one
data channel.
• Because one can write data to the memory as well as read data from the memory, the data
channel serves two ways. This communication pattern known as duplex.
• The read/write line is used to control the memory whether to Read or Write data through
the data channel.
• Similar to the ALU, Memory carries out actions as it receives signals from the clock line.
At each beep of the clock line, memory performs one data operation, whether it is a read
operation or a write operation.
• Each data operation involves a data unit. The size of data units varies from one memory
system to another.
A useful Memory should be able to store many data units. If there are more than one data
unit stored in Memory, there must be a way to identify each one.
• There is a unique address associated with each data unit stored in Memory.
• An address is usually a numeric value numbered sequentially from 0.
• The first data unit in Memory has address 0, the second unit has address 1, and so on.
• The number of addresses is equal to the overall size of Memory.
• The address line is used to specify an address for the current operation.
In the discussion so far, the word data has not been explained that whether it is a number or a
word or other things. This is a data representation problem and this topic will be discussed
later.
Example: Memory operations and the clock rate
Question: Consider that the clock signal to the Memory occurs 2000 times per second, the
Memory has 64 addresses of data units of storage. Calculate the amount of time required to
write data once to all data units.
Answer: Each write operation requires 1/2000 seconds. There are 64 write operations
required to write data to all the data units. The total time required is (1/2000) * 64 seconds.
6
6. Input and Output
The final components are Input and Output. These two components connect the
programmable computer and the outside world. The following shows a schematic diagram of
the two components.
There is one important point about Input and Output. The situation at these two components
is not under the control. The Input and Output are connected to the outside world, which is
beyond the realm of the programmable computer.
• The designers of the programmable computer can control how the components in the
computer are working. In particular through using the Clock line, the operation timing of
the Memory and the ALU can be controlled precisely.
• However, the Input may receive data any time, irrespective of the inner working of the
programmable computer. If there is data entering into the Input but the programmable
computer is not ready to receive them, then the data would be lost.
To solve the problem of potential data loss, a data buffer is added to the Input component. It
is used for storing any input data temporarily, until the programmable computer is ready to
handle them.
The programmable computer may generate a lot of data and write to the Output. It is up to
the outside world to capture the data. Any data loss is not the responsibility of the
programmable computer to ensure that the outside world can see all the data.
Sometimes, buffer is also added to the output for higher efficiency. For example, if the output
is connected to a remote web server and it is usually more efficient to send data in a larger
batch.
7
7. Summary
8
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
This chapter will focus on how to design the Arithmetic and Logic Unit for integer addition
and subtraction. This is the simplest ALU that can handle these two instructions.
1. Overview
Arithmetic and Logic Unit (ALU) is a component that can execute arithmetic and logic
instructions. The most basic ALUs execute few and simplest arithmetic and logic
instructions. The more sophisticated ALUs can execute very complex operations, and they
may contain a number of ALUs within them.
We will begin with the simplest type of ALU and consider first the problem of designing an
ALU that supports addition and subtraction of integers.
9
2. Requirements of a Good ALU Design
The functionality of the ALU includes addition and subtraction of integers. We however do
not just want any ALU; we want a useful ALU for our programmable computer. There are
additional desirable requirements of a good ALU design.
• High Reliability and Error Free. The ALU operations should be highly reliable. It is
useless to have an ALU that sometimes chunks out an error result.
• Simple Design. The ALU design should be simple for reducing the cost of producing
them and the cost of making improvement.
• Efficient Operation. The ALU should operate efficiently and so it will improve the
performance of our programmable computer.
• Large Data Range. The ALU should be able to handle as large data range as possible.
One always desires a calculator that can handle more digits.
• Economical Cost. The ALU should not be too costly to produce. Complexity and
power often comes with increased cost.
The above issues are all important to our design consideration.
However, a most important lesson to learn in this course is that in the real world, not all the
above issues would be taken with the same weight. We cannot have everything. We must be
selective.
• We have to sacrifice some of the desirable characteristics in order to achieve other
desirable characteristics.
• We must be prepared to give and take: give away the lesser important desires and
take the most important ones. This is known as a trade-off.
References: Trade-Off
http://en.wikipedia.org/wiki/Trade_off
10
3. The Main Approaches of our ALU Design
The ALU we are going to design will execute instructions and process data. We are
interested in the details of the operations, for example how addition or subtraction can occur.
• The data or numbers would not be in a written form. The addition would not be done
by pencil and paper.
• The programmable computer is an electronic device and the ALU is built using
electrical circuitry.
• Data and numbers will be coded as electrical signals and arithmetic operations will be
carried out electronically.
Our basic ALU design consists of the following main features:
• The ALU and the programmable computer will use digital representation to code
data in electrical signals.
• The ALU uses binary numeral system to code numbers for arithmetic and logic
operations.
• The ALU uses two's complement binary representation for addition and
subtraction of positive and negative numbers. Two's complement binary
representation is a variant of binary numeral system.
The above three main features help to justify that our ALU is the result of good design
decisions. We will explain these features in the following sections.
11
4. Digital Representation
The ALU will use digital representation to encode data. Data is an abstract entity but
eventually it must be represented somehow with a physical attribute. In electronic systems,
the common physical attribute to use is the voltage. Voltage is a continuous scalar.
There are actually two fundamental ways to represent data with voltages: analogue and
digital. Our ALU will use digital representation for greater reliability and error tolerance.
The following figure explains the analogue representation.
• Analogue representation is continuous. Any small changes in the signal can change
the original value into an incorrect value.
• Analogue representation can represent continuous data values, but it is not tolerant to
noise and other forms of signal degradation.
12
Digital representation represents data in discrete levels.
The following figure shows that a 2-level digital representation is a lot more error tolerance
than a 10-level digital representation.
• The levels are well defined and any sufficiently small fluctuation in the signal will
keep the signal at the same level. Digital representation is therefore less prone to
errors.
• One can decide the number of levels used in a digital representation. The error
tolerance decreases as more levels are defined.
The ALU will use the binary numeral system. The binary numeral system uses two symbols
to represent data '0' and '1'. Therefore its implementation requires a 2-level digital
representation, which is most error tolerant and less technically challenging. Normally, a low
voltage represents '0' and a high voltage represents '1' but it could be the other way round. A
even more reliable method is to encode '0' as a change of voltage from low to high, and '1' as
a change from high to low.
Numeral system supports a systematic and consistent set of rules to represent numbers. The
commonly known numeral systems include decimal, binary, and hexadecimal. Some key
characteristics of numerals system include the following:
• Each numeral system defines a set of numbers such as integers or positive numbers.
• Each numeral system can provide each number in the number set a unique
representation.
• Each numeral system contains of a set of unique symbols, each representing a certain
value in the set of numbers. In the decimal numeral system, the ten symbols are 0, 1,
2, ..., 9.
• Each numeral system provides rules for combining symbols to represent a larger
range of numbers and therefore it can support a large number set.
For example, the decimal numeral system uses the positional notation to combine the
symbols to represent numbers such as 32 and 1589. Larger numbers are constructed by
putting symbols together in juxtapositions.
The base of a numeral system is the number of unique symbols used in the system.
Decimal number system is the norm in today's societies, as in ancient China and the Hindu-
Arabic world. However, in the ancient world, there were all sorts of number systems.
• Vigesimal or base-20 used by Mayans.
• Duodecimal or base-12 used by Nigerians.
• Sexagesimal or base 60 used by Babylonians.
Decimal number system has 10 symbols, from 0 to 9. To represent values larger than 0 to 9,
we use the positional notation. Positional notation is based on a system that each digit is
related to the next by a multiplier, which is the base or the radix of the number system. In
decimal number system, the multiplier is 10. This means for one digit, the digit to the left
hand side is worth 10 times more multiplied by the value represented by the digit.
14
Example: Positional Notation in Decimal System
Question: Why is the number 3456 representing the value 3456 in the decimal numeral
system?
Answer:
Digits 3 4 5 6
3 2 1 0
Representation 10 or 1000 10 or 100 10 or 10 10 or 1
Value of digit 3000 400 50 6
Digits 2 4 7 6
3 2 1 0
Representation 8 or 512 8 or 64 8 or 8 8 or 1
Value of digit (in decimal) 1024 256 56 6
For the same number of digits, numeral systems of more symbols can represent a larger range
of values. A hexadecimal number FFFFFFFF (hex) is equivalent to a large decimal number
4,294,967,295 (decimal). The largest decimal number with 8 digits is only 99999999
(decimal).
However, the number of levels must eventually be implemented as digital representation.
15
We can increase the range of representation with another method. The method is to allow
greater number of digits.
The numeral system can only represent positive numbers (and zero). We must modify the
representation rules if we want to represent negative numbers. There are two possible
methods.
1. Adding a new symbol
• Adopt an additional symbol '-' to represent that the number following is negative.
The number -3456 means negative 3456.
• There are now 11 symbols instead of the original 10 symbols in the decimal system.
The drawback of the additional symbol is added overall complexity. For example,
each additional symbol needs a unique level of digital representation
2. Use a designated digit
• Designate the left most position to indicate the negativity. If the left most digit is '0',
it is a positive number. On the other hand, if the left most digit is '1', then it is a
negative number. So 03456 is positive 3456 and 13456 is negative 3456.
• The drawback of this approach is an additional digit must be added to every number.
Our ALU will use binary numeral system because 2-level digital representation. This is the
most error tolerant representation.
• The basic binary numeral system uses two symbols 0 and 1 and the positional
notation to represent positive values.
• The number of digits determines the range of values that can be represented. A digit
in a binary number is called a bit.
• There are 8 bits in a byte.
• An 8-bit binary number can represent 256 different values.
• If the smallest value is 0, then the range is from 0 to 255.
16
6. Two's Complement Binary Representation for Efficient ALU Operations
The ALU uses a special type of binary representation called the two's complement binary
representation. Even if we have decided upon the binary numeral system, there are still
several ways to represent values with binary numbers. In order to justify our decision that
two's complement binary representation is the most suitable one, we should first review the
metrics for the suitability.
• Hardware reliability. All binary representations require 2 levels digital representation
and hardware implementation would be quite reliable. So this factor is not
considered here when we are comparing binary representations.
• Range of representation. A representation useful for many purposes should support
both positive and negative values.
• Utilization of resource. Given the same number of digits and symbols, each
representation can represent a certain range of values. For example, an 8-bit binary
system (two symbols) has 256 unique combinations. A representation that can fully
utilize the resource can assign all 256 combinations to 256 distinct values.
• Efficiency in executing instructions. The ALU will carry arithmetic and logic
operation. It may be very complex to implement sets of circuitry for operations based
on a certain representation. For example, additional with binary numbers is simpler
than addition with decimal numbers, because there are fewer combinations of
possible pairs of operands.
The two's complement binary representation is chosen because it has the following
advantages:
• It can represent both positive and negative values (and also zero).
• It can fully utilize resources. An 8-bit 2's complement number is mapped to 256
distinct values (from -128 to +127).
• It can support efficient addition and subtraction. Subtraction with 2's complement
numbers is a 2-step process involving a simple bit reversal and addition operations.
The circuitry for addition operation can be reused for the subtraction operation and
simplifying hardware design.
In the following sections we examine variants of binary representations and explain the
reason of choosing the 2's complement binary representation.
17
Positive binary representation
Positive binary representation has the smallest value set at zero. The following table shows
the range afforded by various bit-size of positive binary representation.
Number of Unique
Bit size Range Maximum (Binary) Maximum (Decimal)
Values
1
1-bit 0 to 2 -1 1 1 2
8
8-bit 0 to 2 -1 11111111 255 256
16
16-bit 0 to 2 -1 1111111111111111 65535 65536
32
32-bit 0 to 2 -1 11111111111111111111111111111111 4294967295 42949672956
18
Binary Coded Decimal (BCD)
Binary coded decimal is a special representation that codes each decimal digit independently
into binary representation. For example, the decimal number 68(decimal) is coded in BCD as
the following. Each decimal digit requires 4 bits of binary digits to encode.
6 -> 0110
8 -> 1000
So 68 (decimal) is equivalent to 0110 1000 (BCD)
A drawback of this approach is in the range of numbers that it can represent. A 8-digit BCD
can cover only 0 to 99(decimal) but a 8-digit positive binary number can cover 0 to
255(decimal). Resource utilization is low.
Addition and subtraction operations pose little problem in circuitry implementation. Each 4
digits are considered together in one operation and so addition of two 8-bit BCD numbers
requires two addition operations. BCD addition and subtraction is therefore more efficient
than that in the positive binary representation. The circuitry is a bit more complex. Another
advantage is the ease of conversion to printing and LCD display formats.
Bit size Minimum (Binary) Maximum(Binary) Range (Decimal) Number of Unique Values
8-bit 1111 1111 0111 1111 -127 to +127 255
16-bit 1111 1111 1111 1111 0111 1111 1111 1111 -32767 to +32767 65535
The resource utilization is good, except that there are now two numbers representing the value
zero. They are 0000 0000 and 1000 0000. So an 8-bit sign-magnitude binary representation
can represent 255 values only (from -127 to +127).
Addition and subtraction of sign-magnitude binary numbers is more challenging. The
operations cannot be simplified into smaller operations on individual digits.
19
One's Complement Binary Representation
The method of complement is sometimes used in subtraction. This method turns a subtraction
operation into a complement and an addition operation. For example, the expression 654 -
234 is converted into a complement operation for 234 and an addition operation:
654 - 234 (-234 is converted into +766 by subtracting it from 1000)
654 + ( +766) = 1 420 = 420 (The carry 1 is discarded)
In one's complement binary representation, we represent negative numbers by finding
one's complement of the corresponding positive number. The one's complement operation is
carried out by inverting every bit (turning 0 into 1 and 1 into 0).
Given a positive 8-bit binary number 0011 1000 (decimal 56), to find its corresponding
negative number (decimal -56), we apply one's complement operation to the 8-bit binary
number.
Each time the one's complement operation is applied, the sign of the number is reversed. This
is equivalent to a negation operation.
Note that the number 1100 0111 is in 1's complement format, and it cannot be directly
converted to decimal if it is a negative number. One's complement binary representation is
the system that uses one's complement operation to work out a positive binary number's
corresponding negative number.
20
The following table shows the range of numbers that can be represented by 8-bit and 16-bit
one's complement binary representation.
Bit size Minimum (Binary) Maximum(Binary) Range (Decimal) Number of Unique Values
8-bit 1000 0000 0111 1111 -127 to +127 255
16-bit 1000 0000 0000 0000 0111 1111 1111 1111 -32767 to +32767 65535
21
Two's Complement Binary Representation
The drawback of the one's complement binary representation is a little wastage in the range of
value that it can represent. For an 8-bit binary number, from 0000 0000 to 1111 1111 there
are 256 different patterns. So 8-bit can represent 256 values if there is no wastage. The range
of a 8-bit one's complement representation is -127 to 127, and there are only 255 different
values.
The two's complement binary representation is a small modification of the one's
complement counterpart. The only difference is that two's complement operation is used
instead of one's complement operation.
Two's complement operation has two steps:
• One's complement operation (bit inversion)
• Add one
Given a positive 8-bit binary number 0011 1000 (decimal 56), to find its corresponding
negative number (decimal -56), we apply two's complement operation to the 8-bit binary
number.
Same as the one's complement representation, the two's complement operation is equivalent
to a negation operation. A negative 2's complement number cannot be converted directly to a
decimal number. Its positive absolute value should be determined first by 2's complement
operation, and then the numeral system conversion can take place.
22
The range of an 8-bit two's complement representation is -128 to +127, and there are 256
unique values.
Bit size Minimum (Binary) Maximum(Binary) Range (Decimal) Number of Unique Values
8-bit 1000 0000 0111 1111 -128 to +127 256
16-bit 1000 0000 0000 0000 0111 1111 1111 1111 -32768 to +32767 65536
23
The ALU will support at least three arithmetic operations: negation, addition, and subtraction.
Negation is easy with 2's complement representation because it is equivalent to 2's
complement operation.
Addition and subtraction is actually almost the same implementation with 2's complement
representation. A subtraction can be changed into an addition of a negated operand. For
example A-B is equivalent to A+(-B). The negation of B is done with 2's complement
operation and we then apply addition to the two operands.
Normally the number of bits assigned to represent an integer is fixed for a particular ALU.
This fixed number of bits restricts the range of values that it can represent. It can also cause
problem for arithmetic operations.
Overflow error can be easily detected in two's complement operation. The following table
summarizes the possibility of overflow in addition and subtraction operations.
24
Example: addition of 2's complement numbers
Question: Consider the examples of 127 + 1 and -128 - 1 in 8-bit 2's complement binary
representation.
Answer:
127 + 1
0111 1111 + 0000 0001 = 1000 0000 <= -128 in decimal
Clearly something is wrong because the addition of two positive numbers cannot result in a
negative number.
-128 - 1 => (-128) + (-1)
1000 0000 + 1111 1111 = 1 0111 1111 <= Ignore carry bit, it is 127 in decimal.
Again, something is wrong here because subtracting a positive number from negative
number should not result in a positive number.
Overflows are signals/conditions indicating the result has gone out of range. This is an
exception status needed to be detected.
• Overflow is set when the result of addition or subtraction overflows into the sign bit.
• Addition of two positive integers cannot result in a negative integer and this is an
overflow.
• Addition of two negative integers cannot result in a positive integer and this is an
overflow.
Carry is another important status. Computer systems have a limitation to which the number
of bits can be handled in an operation. For example, a 64 bit value may be used in a computer
system that can handle only 32 bit value operations. Typically the 64 bit value is separated
into two parts, each of which is 32 bit value suitable for operation. Carry status is set when
the result of an addition or subtraction exceeds fixed number of bits allocated.
7. Summary
We have more details on the status output of the ALU. The following figure shows the
revised ALU design. This ALU can carry out integer addition, subtraction, and negation in
2's complement binary representation.
25
Appendix 2A. Base Conversion
Conversion between numbers of different base (numeral system) can be done easily by many
modern calculators.
However, if you are required to do by hand, then there are two approaches to choose from. It
depends on the source base and the destination base.
• If the source and the destination bases are powers of 2, such as binary, octal, or
hexadecimal, then conversion is easy if it goes through binary. For example,
converting base-8 to base-16 can first convert base-8 to base-2, then base-2 to base-
16.
• If the bases are of other numbers, then using the decimal numeral system as the point
of interchange is efficient. For example, to convert from base-3 to base-16, the
method is to first convert from base-3 to base-10, and then base-10 to base-16. Using
base-10 as the point of interchange allows us to do most of the arithmetic in decimal,
which is something we are familiar with.
The following will describe how to convert base-N to base-10, and then base-10 to base-N.
In the following we assume that the positional notation is adopted in all bases and negative
values are represented with a negative sign.
26
A1. Conversion from base-N to base-10
Numbers in base-N numeral system can be converted to decimal easily if they adopt the
positional notation representation.
In positional notation representation, the value of a symbol depends on both the symbol itself
and also the position of the symbol appeared in a number. The right most position is position
0, and then the position index increases by one as moving left one digit. The value
contributed by a digit is equals to the product of the intrinsic value of the symbol and the base
to the power of the position index.
Value of a digit = Intrinsic value of symbol * Base(Position index)
Take the decimal numbers 10020 and 3100 as examples. The symbol '1' is at position index 4
in the number 10020, and so the digit is worth 1 (intrinsic value) * 104 = 10000 (decimal).
The symbol '1' is at position index 2 in the number 3100 , and so the digit is worth 1 (intrinsic
value) * 102 = 100 (decimal).
Value = 32 + 8 + 2 = 42 (decimal)
Question: Convert 123 (octal) to decimal.
Answer:
Digits 1 2 3
2 1 0
Representation 8 8 8
Value of digit 1 * 64 2*8 3*1
Total = 64 + 16 + 3 = 83 (decimal)
Values in any number system can be converted to decimal using the same method.
27
A2. Conversion from base-10 to base-N
Values in decimal numeral system can be converted into any other numeral system by the
method of division.
If a decimal number D is to be converted to base B, D the decimal number D is repeatedly
divided by B until the result is zero. The remainders are to become the digits of the value in
the number system B.
The binary number is read from the bottom to the top: 101010
To convert from hexadecimal to binary, the above process is simply reversed. The same table
is used to convert each hexadecimal digit into 4 binary digits.
28
Conversion between base-2 and base-8
The process is basically the same except that binary numbers should be divided into groups of
three digits.
Powers of 2
Some commonly used powers of 2 should be remembered
29
Appendix 2B. Radix Three
30
Both number of symbols and the width of representation have implications.
• The number of symbols affects the coding of digital signal lines. The more symbols
there are, the more error prone in data transmission. It is more engineering
challenging to design a reliable many-symbol signal line. (From a more human
perspective, to learn 2 symbols is easy, and to learn 10 symbols (0 to 9) is more
difficult but manageable for kids, and to learn 100 symbols is however challenging).
• The width of representation affects the amount of signal lines required. Each digit
requires one signal line, and the addition of a signal line adds complexity and cost to
the computer system.
We want to minimize both the number of symbols and the width of representation, while keep
the range of representation the same. In the above example, we keep the range 0 to 99 the
same to allow us to analyse the relation between the number of symbols and the width.
In mathematical terms, we want to minimize the product of number of symbols n and the
width w, while holding the range constant. The range is found by nw.
The optimal number of symbols (the radix) is found to be the constant e (equals to 2.718).
This calculation assumes that the variable r and w are continuous rather than discrete. The
following graph shows the minimal point is 2.718 for different nw (nw = 10, nw = 100, and
nw = 1000).
For practical reasons we need an integer as the base radix. Because the integer 3 is closer to
2.718 than any other integer, base 3 (ternary) is the most efficient representation
mathematically.
We come back to the question of does a best number system for computer systems exist?
The answer is yes but it depends on your criterion for what constitute your best system.
Different people may come up with different answers. Clearly, almost all modern computer
systems are based on binary. There are two possible reasons for this phenomenon:
• The first computers were based on binary system and so it was cost effective for
subsequent design to follow a well proven model. Therefore binary computer
systems captured a significant "market share".
• Technologically, binary system is easier to implement because of the engineering
techniques for binary operations were well established.
31
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
In this chapter we will refine the design of Arithmetic and Logic Unit for fractional values.
Integers are a very special type of numbers. Real world data is continuous.
• If precision is not an important requirement, then we can use integers to approximate
such real world data.
• If higher precision is required, then floating point representation should be used.
The floating-point representation allows the representation of numbers with a particular
precision requirement. Normally the precision level of a number is equivalent to the number
of significant digits.
• Anders has 1 million dollars asset. We are not sure whether he has exactly
$1,000,000 or $1,999,999. The number of significant digit is only one. It is not
precise.
• Betsy has asset valued $12,540 thousands. We are not sure if she has $12,540,000 or
$12,540,999. However, the number of significant digits is 5. It is a more accurate
description of asset.
The floating-point representation allows the representation of a very large number or a very
small number with any number of significant digits. It is based on the exponent
representation so that the radix point can float or move forward or backward and at the same
time adjusting the exponent.
32
1. Exponential and Floating Point Representations
The floating-point representation is based on the exponent representation. In decimal, the
exponent representation is in the following form.
Sign Significant-Digits x 10exponent
One important feature of the exponent representation is an economical format. The big
number 1620001000000000000000000.0 can be succinctly represented as 1.62001 x 1024.
Real world data are sometimes of a very small or very large magnitude. For example, the
number 0.0000 0000 0000 0000 0001 requires 20 digits to represent. This value is actually
not very small, if we compare it to some of the physical world values such as the mass of
electronics and the diameter of an atom.
The sequence of zeros is useful only in indicating the position of the last digit 1. It may be
more concise to say that the number is 1 after 19 zeros to the right of the radix point. The
same goes for very large numbers such as 1234000000000000.0
If we are finding a representation for our ALU and programmable computer, then a simpler
and more economical format is an advantage.
We can convert any decimal number into exponent format by the process of normalization.
This is a process of moving the radix point of the number until the following normalized form
is achieved.
M.MMMMM… x 10exponent
The number of M type digits depends on the number of significant digits. The M type digits
are called the magnitude part or the mantissa. The position of the radix point is between the
first and the second digit.
• Moving the radix point one digit to the left is equivalent to division by 10. So we
compensate it by adding 1 to the exponent.
• Moving the radix point one digit to the right is equivalent to multiplication by 10. So
we compensate it by subtracting 1 from the exponent.
33
Exercises: Exponent Format
Question: Convert the following decimal numbers into exponent format of the normalized
form. For each of them, indicate the sign, magnitude, and exponent parts. (i) 123000000;
(ii) 123456789; (iii) -3450000; (iv) 0.00001234; (v) -0.00000823
When we are designing a representation dealing with floating point values for the ALU, then
we should start from the resource point of view. We decide that we can afford to have 8
digits to represent a floating point value. So the design issue is how to assign different roles
for the 8 digits.
Remind us that the exponent format has three parts.
• The sign (indicate positive or negative)
• The exponent and the base (indicate the position of the significant digits and the base
of the numeral system)
• The mantissa (the significant or the significant digits)
If the base is set to be 10 (decimal), the question is about assigning which digits are for the
sign, exponent, and mantissa respectively.
Consider the following format for the 8-digit number.
SEEM MMMM
• The symbol S represents the sign. The digit 0 indicates positive and 1 negative.
• The symbol E indicates the exponent digits. The possible range of EE is
from 00 to 99.
• The symbol M indicates the mantissa digits.
The above number represents the following number in exponential format.
S M.MMMM x 10EE
34
The range of the exponent can be refined through a method called excess-N. If the EE is in
excess-50 format, then the number is in excess of 50. The actual value it represents is less 50.
For example, if EE is 01, then the actual exponent is (01-50) = -49.
The excess-N notation changes the range of exponent from 00>99 to -50>+49. The range is
now extended to negative exponents.
Note that there is a region around 0.00000 outside the range of this representation method.
These are very small magnitude numbers. When the representation method fails to represent
such value range, it is known as underflow.
35
Arithmetic operations in exponential representation may be carried out using the following
procedures:
• In addition and subtraction, exponent and mantissa are handled separately. Exponent
must first be aligned and then mantissa overflow can be fixed by adjusting the exponent.
0.12 x 10-2 + 0.345 x 10-4
0.12 x 10-2 + 0.00345 x 10-2
0.12345 x 10-2
• In multiplication and division, the mantissa can be operated normally. The exponent part
is carried out by addition or subtraction. Normalization is required through excess-50
adjustment.
0.2 x 10-2 x 0.4 x 103
0.08 x 101
36
2. IEEE 754 Representation
The IEEE 754 Floating Point Standard is a standard for representing floating point values in
modern computers. The whole name of the standard is called the IEEE Standard for Binary
Floating-Point Arithmetic (ANSI/IEEE Std 754-1985). The original standard defines two
formats of different precision.
• The single-precision 32-bit binary format uses 32 bits to represent a floating-point
number using the exponential representation and excess-N notation.
• The double-precision 64-bit binary format uses 64 bits to represent a floating-point
number using the exponential representation and excess-N notation.
There is also a 128-bit binary format and two decimal formats introduced in 2008, which will
not be discussed here.
The single-precision 32-bit format divides up the 32 bits into the following.
• Sign-bit (1-bit).
• Exponent bits (8-bits) in excess-127 notation and positive binary format.
• Mantissa bits (23-bits).
The above figure shows that the roles of each bit in the 32-bit binary number.
37
The exponent is in excess-127 notation. The range of allowable exponents is from -126 to
127 only. Some exponent values are reserved.
The value NaN means Not A Number. This is a special flag often used as a result of invalid
operations, for example, division by zero or square root of a negative value.
38
Conversion from decimal into IEEE 754 Numbers
The following steps allow the conversion from decimal into IEEE 754 numbers.
• Find out if it is a special value and convert it as such. The special values include zero,
infinity, very small numbers that are in the denormalized range.
• Convert it as a decimal number to a binary number.
• Two sides of the decimal point (the integral part and the fractional part) should be
converted separately.
• Convert the binary number into a normalized format by adjusting the exponent and
mantissa accordingly.
• Reform the digits into IEEE 754 formats.
39
Conversion from IEEE 754 Numbers to decimal
The following steps allow the conversion from IEEE 754 numbers to decimal.
• Extracts the three parts: sign, exponent, and mantissa.
• Find out if it is a special value and convert it as such. Check the exponent and the
mantissa parts.
• Put the digits in the normalized form if it is in normalized mode.
• Remove the exponent by shifting the radix point.
• Convert the integral part and fraction part separately into decimal.
The IEEE 754 64-bit standard offers greater range and precision. It has the following format.
• Sign-bit (1-bit).
• Exponent bits (11-bits) in excess-1023 notation and positive binary format.
• Mantissa bits (52-bits).
40
3. Alternative Representation Methods
There are actually several alternative representation modes that we have decided against in
our ALU design. Each of them has its limitations and you should be able to analyse and
discuss these limitations.
The radix point determines the significance of digits. The digit to the immediate left of the
radix point has positional index of 0. The following table shows why the number 234.56 has
the value 234.56.
Digits 2 3 4. 5 6
Representation 10 2 or 100 1
10 or 10
0
10 or 1 10
-1
or 0.1 10
-2
or 0.01
Value of digit 200 30 4 0.5 0.06
The same representation can be applied to binary numbers. We can place a radix point to
denote the position of the digit of positional index 0. The following table shows the value of
the binary number 1101.01 (decimal)
Digits 1 1 0 1. 0 1
Representation 2 3 or 8 2
2 or 4
1
2 or 2
0
2 or 1 2
-1
or 0.5 2
-2
or 0.25
Value of digit 8 4 0 1 0 0.25
41
Fixed Radix Point
One solution to the above problem is to fix the radix point at some position. For example, we
can fix the radix point between positional index 1 and 2 in an 8-bit binary number. Because
the radix point is built into the format, there is no need for another symbol.
So the 8-bit binary number 11010101 actually means 110101.01
The radix point is implicitly implied to be at the position between index 1 and 2. It decides
how many digits belong to the integral part and how many belong to the fractional part.
If this solution is used, the designer must be careful in deciding the implicit position of the
radix point. Compare the following two designs: one fixes the radix point before index 5 and
one fixes it before index 1.
Radix point
Example Integral part Fractional part Drawback
before index
11010101 => Possible fractional values are 0.00
1 6 digits 2 digits
110101.01 0.25, 0.5 and 0.75 only
11010101 =>
5 2 digits 6 digits The maximum value is only 3.984375
11.010101
The first option suffers from a precision problem and the second option suffers from a range
problem.
We can achieve better range and precision by using a longer representation (16-bit or 32-bit).
Still care must be taken to decide the distribution of digits between the integral part and
fractional part.
42
4. ALU Port Design for Fractional Value Operations
If we adopt the IEEE 754 single precision data representation format, then we can settle on
the input and output interface of the ALU. IEEE 754 single precision uses 32-bits to
represent a number and so the input and output must have 32 signal lines.
We will introduce a new terminology here: bus. A bus is a set of signal line that connects
two components in a computer. Each input is therefore a 32-bit bus. Similarly the output
data size should match the input data size and so the output is also a 32-bit bus.
If the ALU supports both integer and floating-point arithmetic, then the same input buses can
be used for both types of operations. In the case of integer operations, the input data is of 32-
bit 2's complement binary representation.
The following diagram shows our design of the ALU with details in the input and output
ports.
In the real life, some ALUs supports only the integral operations or the floating point
operations, but not both. Incorporating both types of operations in a single ALU will increase
the complexity of the internal design.
In some cases, a computer design includes two ALU, one for integral operations and another
one for floating point operations. Such a computer will include an additional component to
pass instructions and data to the relevant ALU.
5. Summary
We have discussed how to deal with fractional values in our ALU design. With 32-bit input
and output ports, and 2's complement binary representation and IEEE 754 floating point
representation, a large set of data can be handled by the ALU.
43
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
This chapter will discuss the design of a basic programmable computer. We will use our 2's
complement binary representation ALU to be the core of our first programmable computer.
44
• Memory. This component allows the storage and retrieval of data. The ALU obtain
instructions and data from the memory. It interfaces with the ALU through the Memory
Management Unit (MMU).
• Controller. This component coordinates and controls the operations of other components
including the ALU and the registers. The controller, control unit, or micro-controller
can be regarded as the commander of the computer components, directing everything to
work together.
• System bus. The system bus allows data exchange between the ALU and the registers.
• Clock. The clock provides a signal to allow various components to work in
synchronization.
The programmable computer operates in cycles that involve the following tasks.
• Read the next instruction from the Memory.
• Store the instruction in one of the Registers.
• The Controller examines the instruction and issues a series of commands to the
components. The commands are usually concerning asking the ALU to calculate or
moving data between the registers.
The tasks are repeated indefinitely. The last task is the key in a programmable computer that
it performs different actions according to the instructions. The Controller is capable to
recognize different instructions and to direct the components to perform the required tasks.
This design of a programmable computer includes many important features. The following
sections will discuss each of these features in details.
45
2. The Clock and Synchronization of Operations
The ALU receives data from its two input ports, and performs an arithmetic or logic
operation, and delivers the result to its output port.
• For the ALU to operate correctly, the data must arrive at the two input ports in time for
the operation to happen.
• If the data has not arrived in time, then the operation result will be incorrect.
Our solution is based on synchronization of the operations by a clock. Synchronization is a
general technique to make things happening at scheduled times.
• The scheduled times are determined by a clock signal and they are the moments when
operations will occur.
• The ALU would require that input data must arrive at the ports on or before these
moments of operations.
• The clock provides a signal for the ALU to synchronize its operations.
• The rising edges (or the falling edges) of the signal are used as the moments of
operations.
• Other components that provide data to the ALU also use the clock as a reference for their
operations.
The clock signal has helped to ensure that the input data are ready. There are still two more
challenges to the correctness of ALU operations.
• The precise operation of other components so that the input data are actually ready.
• The output data may be lost if the handling component is not ready to receive them.
Our solution is to introduce a small piece of memory called registers to buffer the input and
output.
• A buffer is a data storage that can be used to stored and retrieves data.
• The registers are a very fast type of buffer or memory.
46
• Input data arrives earlier than the schedule operation moment can be stored in the
registers and wait for the operation moment to occur. This reduce the challenge of
requiring the data providing components to meet the timing precisely.
• Output data is stored in a buffer to keep the data for a short while (between two operation
moments), allowing the data-receiving component to read the data.
The following figure shows the ALU input and output ports connected to registers.
• The register connected to the output port is considered as important because it contains
results of operations. It is named accumulator (ACC).
• The other registers are usually labelled with an index or a number starting from 0. The
two registers at the input ports are called R0 and R1.
• The operation of the registers is also under the control of signals.
• Each of the registers can hold 32 bits of data, which is consistent with the bus width of
the ALU.
47
4. Register for Storing Instructions
A programmable computer supports the execution of programs, which contains a number of
instructions carefully crafted together. A programmable computer executes a program by
executing the individual instructions of the program, one instruction after another.
The Controller is the component responsible for actually carrying out the tasks to complete an
instruction. These tasks can involve one of the following.
• Command the ALU to carry out different arithmetic and logic operations.
• Read data from the memory or write data to the memory.
• Move data between registers. For example, moving the data from the ACC to an input
port of the ALU.
The Instruction Register (IR) is a component for storing the current instruction. This register
has no different from other registers, except that this is the place where the Controller is
looking for current job. After the next instruction is loaded from the memory, it is stored in
the Instruction Register.
48
The Form of Instructions
In our programmable computer, the instructions are in the form of binary numbers.
• The memory system has to store data in binary number format. The instructions are
stored also in the memory system. So the instruction form must be consistent with the
data form.
• A programmable computer designer needs to designate binary numbers to represent
different instructions. For example, if our computer supports five instructions: add,
subtract, negate, load data, and store data. We need to decide which binary
representation corresponds to add instruction and which other binary representation
corresponds to subtract instruction.
Assume that the designer has decided to use 32-bit binary numbers for representing
instructions. The following table shows a possible mapping for the five instructions.
The 32-bit binary representations above are known as operation code or opcode.
• The computer designer is free to designate different numbers for different instructions.
• The actual designation however has an impact on the efficiency of instruction execution.
• The instruction register should be large enough to hold every instruction.
• The more sophisticated programmable computer may support a larger number of
instructions. For each instruction, computer designers would designate an opcode.
• The whole set of instructions supported by a computer is called an instruction set.
• The richness of the instruction set of a computer determines its programmability. With
more variety in the instruction set, we can build programs to perform a greater range of
tasks.
49
5. Memory System for Data and Instruction Storage
The Memory System is essential for storing data and instructions. It is essential for storing
the results of program execution. The ACC can hold only 32 bits and not sufficient for the
task.
The following are the features of the memory system that was discussed previously.
• Each unit of memory is 32-bit. This size should be consistent with the size of other
components.
• Each memory unit has a unique address, which is also a number.
• The interface to the memory system consists of a data port, an address port, and control
signals. The data port is for the transfer of data. The address port is for specifying the
address. An example of control signals is to control whether the memory operation is
read or write.
• The memory system operation is synchronized with a clock signal. For the memory
system to operate correctly, the timing of the data arriving at the ports must be precise.
The data port and the address port are connected to other components of the programmable
computers.
To improve the operation resilience of the memory system, buffers, in the form of registers,
are added to the address port and the data port. They are called Memory Address Register
(MAR) and Memory Data Register (MDR).
• MAR will hold the address of the current memory operation.
• MDR will hold the data for the operation.
50
The problem now is to connect the Memory System to the ALU components. The following
figure shows how it is done.
A key feature of the design is a data bus that connects all the registers: IR, ACC, R0, MAR,
and MDR.
• The data bus allows data to be moved from the ACC to the MDR and then to the memory.
• The data bus allows data from the memory to be moved to the MDR and then to the ACC
for the next operations.
• These two data movement routes allow data to be moved between the calculation centre
of the programmable computer and the data storage centre.
• The MAR is connected to the ACC, allowing the address of memory operations to be
controlled by the results of operations.
• The MAR is connected to the IR, allowing the address of memory operations to be
controlled directly by instructions.
The Controller plays an important role in signalling the various components to operate
meaningfully. The Controller works according to the instruction that is read into the IR. If
the instruction is add, then the Control Unit sends appropriate signals to the ALU and other
registers to perform an add operation.
51
Instructions for Interacting with the Memory System
Our programmable computer is getting into shape. For the computer to operate with the
memory system, we need instructions for moving data between the Registers and the
Memory.
Two instructions are designed for the task: load and store. The load instruction moves a data
from an address location in the memory to the ACC. The store instruction moves the data in
the ACC to an address location in the memory.
Instruction 32-bit Binary Representation
Add 0000 0000 0000 0001 0000 0000 0000 0000
Subtract 0000 0000 0000 0010 0000 0000 0000 0000
Negate 0000 0000 0000 0011 0000 0000 0000 0000
Load 0000 0000 0000 0100 <16-bit memory address operand>
Store 0000 0000 0000 0101 <16-bit memory address operand>
The load and store instructions include a parameter that specifies the memory address to load
or to store.
• The parameter or the operand is part of the instruction, taking up the last 16 bits of the 32
bit binary representation.
• The instruction representation is designed this way to save space.
• These two 32-bit instructions contain both the opcode and the operand.
The following figure illustrates the steps to execute a load instruction.
We begin with the load instruction already loaded into the IR.
• The instruction stored in the IR contains an opcode and an address operand.
• The opcode part is checked by the Control Unit and understood to be a load instruction.
• The Control Unit signals the IR to send the address operand to the MAR, and the Control
Unit sets the R/W line of the memory system to Read
• The Memory system carries out the operation of reading a data from the prescribed
address. The data is sent to the MDR.
• The Control Unit signals the MDR to move the data to the ACC.
52
6. Von Neumann Architecture
We will return to the problem of the form of programs that we have not yet addressed. The
instructions in a program must be codified before it can be passed to the IR and the ALU.
The original form of the program, however, can vary quite a lot. The following shows some
early examples:
• Hardwired. Not programmeable.
• Punched film stock. (Zuse Z3 in 1941).
• Rewiring to achieve partial programmeablilty (Colossus in 1943 and ENIAC in 1944).
• Punched paper tape (Harvard Mark I IBM ASCC in 1944).
• Function table ROM (ENIAC in 1948).
The above designs show that certain characteristics about the approach of handling programs
in computers.
• Programs should be changeable so to alter computer operations.
• Programs should be able to store away for later and repeated use.
• Programs should be readily accessible to the processors of computer. The speed of
reading the programs should be fast.
The Von Neumann architecture specifies that the program and data will be stored together in
the Memory System. It allows the flexibility to re-programme a computer through
manipulating the program stored in memory.
• Programs can be easily changed through modifying the memory electronically.
• Programs are readily accessible by electronic signals.
• Programs may be stored indefinitely in memory.
• Stored programs allow them to modify themselves in operation. Programs code is simply
data in memory cells.
• Stored programs allow the likes of compilers and interpreters possible. The purpose of
these programs is to write programs.
53
Program code is now stored in memory and the system bus supports a data movement route
to move instructions from memory to the IR. The Control Unit can take the following steps
to move an instruction to the IR.
• Control Unit sends the address containing the next instruction to MAR
• Memory system retrieves the data of the address and sends the data to the MDR
• Control Unit moves the instruction from MDR to IR.
54
7. System Bus for Connecting the Registers
The registers in the programmable computers are all connected together with the system bus.
• The system bus is the most important highway for data movement in the computer.
• The system bus allows a pair of the registers to move data between them.
• At any one moment, only one such data movement route can operate.
• This is a limitation of the system bus design.
• Although a system bus can connect many registers, only two of them can exchange data
at any one time.
Data movement can be speed up by having them to happen in parallel. The system bus can be
replaced by the fully connected network. Each pair of registers now has a dedicated highway.
There are a few drawbacks to this approach:
• The fully connected network is clearly a lot more costly to build
• A register can only handle one data movement at one time even if it is connected to all
other registers.
• Some connections have no use in the operations of the computer.
The performance of the system bus is an important factor to the working of the programmable
computer.
• The system bus operates according to the clock and signal from the Controller.
• The Controller determines which pair of components are to establish connection and to
move data.
55
8. Input and Output Controllers
The design of our basic programmable computer is completed by adding the input and output
controllers. These controllers are connected to the system bus.
The components that are connected to the input and output are considered as peripherals.
• Input devices such as keyboard and mouse, and output devices such as monitor are
common.
• Some IO devices can operate both as input and output devices.
• A hard disk is a memory device that can do both input and output.
The programmable computer has two levels of memory. The memory system that connects
directly to the system bus through the MAR and MDR is called the main memory or
primary memory. The main memory stores data and program for program execution. The
memory system that connects through IO controllers is called secondary memory. Second
memory is usually designed for long-term storage.
56
9. Central Processing Unit (CPU)
For ease of design, implementation, and production, some components of the programmable
computer are integrated closely to form a single component called the Central Processing
Unit (CPU).
The following figure shows that the CPU includes the ALU, the main registers, the system
bus and the controller.
Components of a computer system operate at different speed. Some are designed that way
and others are controlled by external factors.
• ALU speed is decided by its designed clock speed.
• Main memory speed is decided by physical characteristics of the memory system.
• Secondary memory speed is also decided by the design, and it is of a slower data transfer
speed (operating speed) than the main memory.
• Input speed is partly decided by the input device and partly decided by the user.
• The output device determines the output speed.
• The system bus decides data transfer speed between components.
Coordination is required to allow two or more components to communicate and to work
together. There are some rules that the components follow when they communicate, and this
often involves one component waiting while other component are doing their work.
57
Appendix A. Amdahl's Law
In this appendix, we will discuss how to estimate the performance or speed up of a computer
system from its components with Amdahl's Law.
The overall speed of a computer system is limited by its slowest components in the chain of
operation.
• For example, a piece of data to be moved from the secondary memory to the primary
memory and then executed by the CPU.
• This chain of operation will consist of vastly different speed of operation.
• The secondary memory is the slowest and therefore the speed of this operation is limited
by the speed of the secondary memory.
A computer system will carry out many operations.
• Moving data from the secondary memory and executed by the CPU is only one of them.
• Another operation may be simply moving data from the primary memory to the CPU and
executed there.
If a computer system has only these two operations, then the overall speed of the computer
system is depending on the frequencies of the two operations and the individual speed of the
operations.
Clearly if one operation is very slow, for example the first operation that involves the
secondary memory, the overall speed is effectively determined by the first operation.
Many people are hoping for a computer speed-up by replacing a slower component with a
faster component. The CPU is often the target of a computer upgrade such as replacing a
CPU with faster clock rate. Is replacing with a faster CPU an effective method to speed up a
computer system?
_ _
= 1
ℎ _ _
1
, = 2
1− +
1
= 3
1−
58
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
This chapter will discuss the design and evaluation of a programmable computer through
implementation of an emulator.
We have so far come up with the following programmable computer design.
In this chapter we will construct an operational computer based on the above design.
The computer is called Little Man's Computer (LMC). There are several significant
deviations from the previous design to allow us to look in other issues. LMC comes with an
instruction set, allowing the writing of LMC programs.
59
The following figure shows the design of the LMC.
LMC is a common teaching tool used by many universities in teaching computer architecture.
The original LMC would be described in a fictional mood. This is not the approach adopted
by this course. However, the following describes LMC in the original manner for your
interest.
The following lists the major components of the LMC, and the corresponding components in
a real computer.
The little Man
The real computer Role
Computer
Arithmetic and Performs arithmetic and logical operation such as addition and
Calculator
logical unit subtraction
Controls the steps (when and where) to load the data from
Little man Control unit memory into the arithmetic and logical unit (the calculator in the
Little Man Computer).
They are used to store instructions and data. Each mailbox has a
Mailboxes Memory label starting from 00 to 99. Mailbox number 00 is equivalent to
memory address 00.
Instruction This is used to keep track of which program line is being
Program counter
location counter executed.
Input controller and This is used to receive data from the outside world into the
In-basket
buffer computer.
Out-basket Output controller This is used to send data from the computer to the outside world.
60
The following figure shows the components in the LMC in an illustrative style.
The Little Man is hidden inside a room where there are a few specific connections to the
outside world only.
Notes about the various components:
• There are 100 mailboxes, each with an address from 00 to 99. Each mailbox address is
therefore represented with two digits. Each mailbox can hold a three-digit decimal
number, which is the content of a mailbox.
• The calculator is available for doing simple arithmetic and storing data temporarily. The
display on the calculator is 3 digits wide.
• The location counter is a hand counter for the little man to keep track of its work. The
counter keeps a two digits number (from 00 to 99). The counter has a reset button outside
the room, allowing an external instruction to reset the counter.
• Other than the reset button, the only other connections to the outside worlds are the in-
basket and the out-basket.
• There is of course the Little Man who will perform tasks that will be described later.
A user can communicate with the Little Man by placing a 3-digit data in the in-basket,
however it is up to the Little Man to read at a particular time. The Little Man can also leave a
3-digit data in the out-basket.
No other form of communication with the Little Man is possible.
61
2. LMC Instruction Set
The instruction set contains instructions that can be used to compose programs. The
instruction set determines the richness and variety of programs that a programmable computer
can support.
There are several issues need to be consider in the design of the LMC instruction set.
The first instructions to design are the LOAD and STORE instructions that move data
between the Memory and the ACC.
Instruction LOAD
Opcode 5
Instruction Format 5 XX (XX is the address to load to the ACC)
Example 512 is an instruction to cause the data in memory address 12 to be copied to the ACC
• The little man goes to the mailbox and retrieves the value in the specified address.
• The little man then enters the value in the calculator.
• The previous value in the calculator is therefore overwritten.
• The data in the specific address of the mailbox remains the same.
Instruction STORE
Opcode 3
Instruction Format 3 XX (XX is the address to store the data in ACC)
Example 312 is an instruction to cause the data in the ACC to copy to the memory address 12
• The little man goes to the calculator and retrieves the value there.
• The little man then place the value in the mailbox according to the specified address.
• The previous value in the mailbox is therefore overwritten.
• The data in the ACC remains intact
These two instructions allow LMC programs to carry out arithmetic operations.
Instruction ADD
Opcode 1
Instruction Format 1 XX (XX is the address containing the second operand)
Result ACC will store the sum of the ACC and the data in memory address XX
Example 120 is an instruction to sum the ACC and the data in address 20 and the result is
stored in ACC
• The little man goes to the mailbox and retrieves the value in the specified address.
• The little man then adds the value to the value already stored in the calculator. The result
of the addition is stored in the calculator.
63
Instruction SUBTRACT
Opcode 2
Instruction Format 2 XX (XX is the address containing the second operand)
Result ACC will store the difference between ACC and the data in memory address XX
Example 220 is an instruction to subtract the data in address 20 from the ACC and the result is
stored in ACC
• The little man goes to the mailbox and retrieves the value in the specified address.
• The little man then subtracts the mailbox value from the value already stored in the
calculator. The result of the subtraction is stored in the calculator.
• It is possible to end up with a negative value in the calculator. Negative values are
allowed in the calculator, but not in any other components in the LMC.
These two instructions allows LMC programs to input and output data.
Instruction IN
Opcode 901
Instruction Format 901 (this instruction is an exception because it has no operand)
Result The data in the input buffer is copied to the ACC
• The little man goes to the in-basket and picks up a value there.
• The little man then moves to the calculator and enters the value.
• The previous value in the calculator is therefore overwritten.
• It is possible to have multiple values left in the in-basket. These values are picked up by
the little man in a first-come-first-served basis.
Instruction OUT
Opcode 902
Instruction Format 902 (this instruction is an exception because it has no operand)
Result The data in the ACC is sent to the output
• The little man goes to the calculator and retrieves the value there.
• The little man goes to the out-basket and places the calculator value there.
• The value in the calculator remains there.
• It is possible to have multiple values placed in the out-basket. These values preserve their
original order when the users are receiving them.
These two instructions have the same effects. They stop the LMC computer.
64
BRANCH instructions
So far LMC programs must be executed in a sequence. The instruction location counter (PC)
always increases by one after the completion of an instruction. The following branch
instructions however allow the instruction location counter to be changed. The next
instruction to execute can be another address.
Unconditional branch instructions force the execution to move to another address.
Conditional branch instructions moves the execution to another address only if certain
condition is met.
• The little man goes to the instruction location counter and store the address part of the
instruction there.
• After the completion of this instruction, the next instruction will be retrieved from the
address stored in the instruction location counter. The Little Man expects a valid
instruction stored in the address there.
• For example, the instruction 6 2 3 means that the value 23 is stored in the instruction
location counter. The next instruction to execute is the content stored in the mailbox
address 23.
• The little man goes to the calculator and checks the value stored in the calculator to see if
it is zero.
• If the value is zero, then the little man goes to the instruction location counter and store
the address part of the instruction there.
• If the value is not zero, then to the little man the instruction is completed.
• The little man goes to the calculator and checks the value stored in the calculator.
• If the value is zero or positive, then the little man goes to the instruction location counter
and store the address part of the instruction there.
• If the value is not zero or positive, then to the little man the instruction is completed.
65
3. Example LMC Programs
66
Comparing Two Numbers
The following compares two numbers and prints the larger number.
Program Loader
Program loader is an important component of an operating systems. If we were to develop an
operating system for LMC, this would be the first program needed.
67
4. Program the LMC
To program the LMC with mnemonics, a programmer can follow the steps below:
• Write LMC program using mnemonics using an editor.
• Use a program known as an assembler to assemble the 3-digit instructions from
mnemonics. The assembler will ensure that the first instruction is at address zero.
• The programmer presses the Reset button to start. The Reset button sets the Program
Counter to zero, and this is where the next instruction is executed.
Modern day computer programmers seldom use assembler to write program. High-level
programming languages such as C and Java are used instead. Programmers use compilers
and linker to convert a program written in high-level program into instructions that can be
executed by the CPU.
68
Conversion of a Selection Statement
The following shows a simple program written in C.
In LMC
00 IN ; Input data
01 STO 99 ; Store in 99
02 SUB 98
03 BRP 06 ; Branch if x>=5
04 LDA 97 ; Load '1'
05 BR 08
06 BRZ 04 ; Branch if x==5, output '1'
07 LDA 96 ; Load '0'
08 OUT ; Print
09 HLT;
96 DAT 00 ; Constant 0
97 DAT 01 ; Constant 1
98 DAT 05 ; Constant 5
99 DAT
In LMC
00 LDA 99
01 SUB 98 ; Calculate X - 10
02 BRZ 08 ; Jump to after the loop
03 LDA 99 ; Load X
04 OUT ; Print X
05 ADD 97
06 STO 99 ; X = X + 1
07 BR 00 ; Jump to the top of loop
08 HLT
97 DAT 01 ; Constant 1
98 DAT 10 ; Constant 10
99 DAT 0 ; Variable X
69
5. Benefits and Hazards of Von Neumann Architecture
In a stored program architecture computer such as the LMC, both data and program exists in
the memory system.
This design has a number of benefits:
• Simpler computer design. A single memory system is needed for both data and program
instructions. Otherwise, separated memory systems would be needed, and each would
need its own input/output devices.
• Allows a programmer to write instructions that modify or create other instructions. This
could reduce program size and improve programmability. The operating system example
for LMC below would not be possible without self-modifying instructions.
However, this can cause a problem.
• There is no indicator in an individual address to signify whether it is an instruction or
data.
• LMC has no way to predict if an individual address contains an instruction or a data. The
LMC can only assume it to be a valid instruction.
• LMC can find out that the address contains an invalid instruction only after reading it in.
For example, 903 is not a valid instruction. The LMC cannot execute this and will
possibly cause a system crash or exception.
• It is however that a data happens to be also the same value as a valid instruction, causing
the LMC to perform unpredictable actions.
• Only the programmer has this knowledge and so it is up to the programmer to take care.
When a programmer write programs using mnemonics, one can use the DAT mnemonic store
any constant in an address. The DAT signifies that the programmer does not expect this
mailbox to be fetched and executed by the Little Man. However, the LMC can still execute
the data if the data is a valid instruction.
71
• (1 Check) The Control Unit checks the IR and then act according to the instruction.
• (2 Addr) The Control Unit moves the address part of the instruction from IR to MAR
• (3 Data) The Memory System retrieve the required data and send to MDR
• (4 Data) The Control Unit moves the data from MDR to ACC
The execution of a LMC instruction involves a number of operations. These operations are
called micro-operations. Many of these micro-operations involve moving data between
components, especially the registers. This is a data movement perspective of computer
operation.
The operation of the programmable computer is essentially boiled down to moving data
between components. The computer will execute instructions faster if the data movement is
faster. The following lists the common data movement patterns involved in LMC instruction
execution:
• PC to MAR (the address of next instruction)
• Memory System to MDR (memory read data)
• MDR to Memory System (memory write data)
• MDR to IR (the current instruction)
• MDR to ACC (data)
• ACC to MDR (data)
• MDR to MAR (memory address)
The Control Unit (CU) is the coordinator of computer operations. It sends signals to various
components so that micro-operations can be carried out meaningfully. The following
illustrates the step-by-step actions of the CU taken to execute a LDA instruction.
• Control Unit moves the address in the PC to MAR
• Memory system retrieves the data of the address and sends the data to the MDR
• Control Unit moves the instruction from MDR to IR. The instruction stored in the IR
contains an opcode and an address operand.
• Control Unit signals the PC to increase by one
• The opcode part is checked by the Control Unit and understood to be a //load//
instruction.
• The Control Unit signals the IR to send the address operand to the MAR, and the Control
Unit sets the R/W line of the memory system to Read
• The Memory system carries out the operation of reading a data from the prescribed
address. The data is sent to the MDR.
• The Control Unit signals the MDR to move the data to the ACC.
The first four steps are common to all instructions and the last steps are specific to the
instruction being executed. After the last step, the computer operation returns to the first step
to execute the next instruction. Computer operations are carried out in an unceasing cycle.
72
7. Fetch and Execution Cycle
Computer operations are carried out in an unceasing cycle of micro-operations execution that
is coordinated by the Control Unit.
This cycle is known as the fetch and execution cycle.
• This cycle repeats indefinitely until the computer is halted.
• The first part is known as the fetch part. The purpose is to fetch the next instruction into
the IR. The specific micro-operations of this part are always the same.
• The next part is known as the execution part, which carries out different actions
according to the specific instruction being executed.
The following figure illustrates graphically the fetch and execution cycle.
In the fetch and execution cycle, each step in the cycle is a micro-operation.
The rudimentary micro-operations performed by the CPU include the following:
• Invoke a function on the memory system
o Fetch data from a specific memory location
o Store data to a specific memory location
• Invoke a function on the program counter (add one)
• Invoke a function on the ALU
o Carry out an arithmetic or logic operation
• Transfer data from one register to another register.
73
8. Register Transfer Language (RTL)
The Register Transfer Language (RTL) provides a concise language for us to describe
micro-operations in our programmable computer.
• It details the data movement routes and pattern,
• It makes the number of steps and time taken to execute an instruction becoming more
visible.
• It explains the steps in the fetch-execute instruction cycle.
The RTL uses different notations for different meanings:
• Capitalized names are registers or other components in the programmable computer. For
examples, ACC, R0, IR, and PC.
• Square brackets [ ] indicate one part of a register or the address of memory. For example,
IR[address] means the contents of the IR.
• The equals sign = indicates that the content of the memory address or register is a certain
value. For example, M[5] = 3 means the content of memory location 5 is now assigned
the value 3.
• The arrow > indicates movement of data. For example, M[1] + M[2] > M[R1] means the
contents of memory locations 1 and 2 are now added together and put back into a memory
location as specified by R1.
Example: RTL
PC -> MAR Send instruction address to MAR
M[MAR] -> MDR Read the current instruction
MDR -> IR Copy the instruction to the IR
PC + 1 -> PC Point to the next instruction
74
Example 2: ADD Instruction
The following RTL describes steps of executing an instruction of ADD A, ACC, which adds
the data of a memory address A to ACC. The result is stored in ACC.
Example: RTL
PC -> MAR Send instruction address to MAR
M[MAR] -> MDR Read the current instruction
MDR -> IR Copy the instruction to the IR
PC + 1 -> PC Point to the next instruction
The above assumes that A is part of the instruction and so it is available in the IR.
• The first four lines are the fetch phase.
• The Control Unit then copies the address operand in the IR and puts it in the MAR.
• The Memory System then sends the data of address A to MDR.
• Finally, the addition of MDR to ACC is carried out by the ALU.
Example: RTL
Question: Write down the RTL for an instruction ADD R0, 4. The instruction adds the
content of Memory Address 4 to R0.
Assuming that the address of the instruction ADD R0, 4 is 2000, the data in address 4 is 10,
the data in R0 is 30. The content of the major registers after the execution of the instruction
are shown below.
Register Data/Content
PC 2001
IR Holding the instruction code of ADD R0, 4
R0 40
MAR 4
MDR 10
75
Exercise: LMC program
Question: Given the following LMC program.
Assume that the PC is 03, write down the steps in executing the instruction STO 11 with
RTL.
Answer:
PC -> MAR
M[MAR] -> MDR
MDR -> IR
IR[ADDRESS] -> MAR
A -> MDR
MDR -> M[MAR]
PC + 1 -> PC
Question: With the same LMC program, write down the steps in executing the instruction
BR 00 at address 06 with RTL.
Answer:
PC -> MAR
M[MAR] -> MDR
MDR -> IR
IR[ADDRESS] -> PC
Question: The LMC is executing the instruction BR 00 at address 06. Write down the
content of MAR, MDR, IR, and PC after the execution.
Answer:
We should look at the RTL and know what values have been loaded to these registers.
MAR = 06
MDR = 600
IR = 600
PC = 00
76
Benefits of RTL
Studying the Register Transfer Language enables us to understand the effort involved in
execution of an instruction. The RTL reveals that some instruction requires more effort.
Generally RTL can help us to determine the following.
• Normally a clock controls the execution of the steps in RTL, and so each step can be
completed in one clock cycle. RTL allows us to easily estimate the clock cycles per
instructions.
• With RTL, designers can investigate if any steps can be carried out in parallel.
Depending on the architecture of the CPU design, some CPU supports more than one bus
between its components and so some steps may be carried out in parallel. In this case, the
number of clock cycles for some instruction can be reduced.
• The RTL can specify the execution of every instruction in a procedural manner, which
can be used in the implementation of the control unit of the CPU. The control unit of the
CPU is responsible for coordinating the data movement and the components, and it may
be implemented as a programmable micro-controller in RTL.
If one step needs one clock cycle to complete, then the above examples show that some
instructions like STO would take 7 steps (cycles), and other instructions like BR would take 4
steps (cycles).
77
9. LMC Performance Analysis
The following table summarizes the theoretical number of cycles required for every LMC
instructions.
The theoretical figures are worked out based on the following assumptions:
• Each data movement between registers takes one cycle.
• Each memory system operation takes one cycle.
• Each input and output operation takes one cycle.
If the number of execution cycles of each LMC instruction is known, then the speed of LMC
program execution can be easily calculated.
Question: You have written a LMC program. In one execution of the program, you
counted the number of instructions executed: there are 300 LDA or STO instructions, 120
ADD or SUB instructions, 20 BR, BRZ, or BRP instructions, and 5 IN or OUT
instructions. If the CPU clock rate is 100 MHz, calculate the time taken to execute the
program
Answer:
Total number of cycles is calculated from the summation of number of cycles for each
instruction.
= 300 x 7 cycles + 120 x 7 cycles + 20 x 4 cycles + 5 x 5 cycles
= 3045 cycles
The clock rate is 100 MHz, which means 100 M cycles per second
The time take to execute the program is 3045 / 100 M = 0.00003045 seconds
The running time of a program not only depends on the total number of instructions executed,
it also depends on the composition of the instructions.
78
Exercise: Execution Speed of LMC Programs
Question: Both Anders and Betsy have written a LMC program to find out the square of a
number.
Anders' program execution has involved 40 instructions, including 15 ADD/SUB, 20
LDA/STO, 3 BR/BRZ/BRP, and 2 IN/OUT.
Betsy program execution has involved 42 instructions, including 14 ADD/SUB, 18
LDA/STO, 8 BR/BRZ/BRP, and 2 IN/OUT.
Which program is better in terms of execution speed?
Answer:
Total number of cycles is calculated from the summation of number of cycles for each
instruction.
Anders' program
= (15 + 20) x 7 cycles + 3 x 4 cycles + 2 x 5 cycles
= 267 cycles
Betsy's program
= (14 + 18) x 7 cycles + 8 x 4 cycles + 2 x 5 cycles
= 266 cycles
Betsy's program ran faster, even if the total number of instructions is more.
Using the cycles per instruction is a useful tool to evaluate the performance of a program. If
the instructions take fewer cycles to execute, then program execution can be faster. This can
be achieved by improving computer design.
LMC is a useful abstraction of real computers. However, the cycle per instruction in a real
computer is not simply the number of RTL steps. Here are the differences:
• Memory system normally takes longer than one CPU cycle to perform load/store.
• Program counter increment can occur at the same time as another micro-operation, and it
does not require one cycle.
79
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
This chapter will discuss the technologies for real computer systems. The discussion on the
LMC has concluded with four major components in computer operations.
• CPU or processor: executing instructions
• Bus: data movement for executing instructions
• Memory: data storage and retrieval
• IO: data input and output
This chapter will discuss the technologies developed for these four major components.
80
The entire manufacturing process takes place in highly controlled environment in special
manufacturing plants. The duration of the process is typically around 2 months. Here are the
common steps taken:
• Wafer preparation: crystallisation of pure silicon
• Wafer processing: printing circuitry onto silicon wafer
• Die cutting and attachment: units of CPU die are cut from wafer
• Chip packaging and testing
CPU cost is usually a significant part of the cost of a computer system. The cost of a CPU
chip depends on a few factors, and the most important ones are:
• Maturity of the manufacturing process.
• Size of the CPU chip.
• Raw materials.
• Competition in the market.
The current manufacturing process in 2014 is called the 14-nanometre process (14 nm). The
figure is roughly indicative of the (half) distance between features in the printed circuitry.
One important feature of a matured process is the yield. A very new and immature
manufacturing process produces more defects in the wafers and the dies. The overall cost is
therefore elevated to cover the loss caused by the defects. The following formula gives an
estimation of the cost of a die.
WaferCost
DieCost =
Dies _ per _ wafer × Die _ yield
In 2009, processing a 300mm wafer costed around US$2800 but a 150mm costed less than
US$450 (Reference: GSA Wafer Fabrication Pricing Reports). The 300mm and 150mm are
diameters of a circular wafer. The number of dies per wafer can be estimated using the
following formula.
81
π ×(Wafer_diameter / 2)2 π ×Wafer _diameter
Dies_ per _wafer = −
Die_ area 2×Die_area
Question: The die size of an Intel core i7 is 263 mm square, calculate how many
dies can be cut from a 300 mm wafer.
Answer:
π × (300 / 2)2 π × 300
Dies _ per _ wafer = − = 268.7 – 41.1 = 227 dies
263 2 × 263
The die yield is dependent on the wafer yield (how many wafers are defected) and the defects
per unit area in the manufacturing process. Typical range is 0.3 to 0.6 for new processes. A
large die makes it more likely that a defect occurs in the area occupied by the die.
82
2. Bus Technologies
A bus is a data channel for transferring data from one device to one or more other devices.
The system bus, which connects the various registers in the CPU, is an example. There are
many buses in a computer system.
A bus consists of a number of lines, each of which serves one of the following four purposes
• Data. A data line is binary encoded, and therefore it can carry one bit of data at an
instance.
• Addressing. An address line is binary encoded, and therefore it can carry one bit of data
at a time. The data on an address line represents an address.
• Control. A control line is also binary encoded. The data on a control line represents a
signal. For example, the control unit (CU) sends a signal to the program counter (PC) on
a control line (to invoke the increment function).
• Power. The power at a particular stable voltage supplied by the computer system.
Although the data sent on a bus is said to be binary encoded, there is usually a lower level
encoding scheme to code the binary values 0 and 1 into other signal representation. For
example, the USB encodes binary data with the NRZI encoding scheme, which represents 0
and 1 with transition of two signal states.
Bus Throughput
Bus throughput is the amount of data transfer on a bus per second. Bus throughput is often
called data rate or bandwidth. For example, USB 2.0 data rate is 480 Mbit per second.
Some buses such as the front side bus (FSB) on a PC is rated in term of frequency. The
frequency defines the period required to send 1 unit of data. The relation between frequency
and period is given in the formula:
1
Frequency(Hz) =
Period(s)
For example, a 500 MHz FSB means that the cycle period is 2 ns.
• A high throughput means moving more data in a particular time frame.
• If one data line in a bus can move a unit of data in a cycle, then theoretically a 32-line
data bus can move 32 units in a cycle.
• Basically there are two ways to achieve high throughput: increasing the transfer rate and
increasing the number of lines.
The above diagram assumes that one data line can transfer 1 bit of data per cycle.
83
The following diagram illustrates the benefits of higher data rate and a wider bus.
Question: Given that each data line can complete the transfer of 1 bit in 200 ns. Calculate
the throughput if the bus has a total of 32 lines (a 32-bit bus)
Answer:
Some modern bus systems supports multiple data movement moments in one clock cycle.
For example AGTL+ allows 4 transfers per cycle.
Question: Given that an AGTL+ is running on a clock rate of 100 MHz, and the bus is 64
bit. Calculate the throughput.
Answer:
Data Rate per Line = 100 M bits / second x 4 transfers = 400 M bits / second
Bus Throughput = 64 x 400 M bits / second = 25.6 G bits / second = 3.2 G bytes / second
This facilitates data transfer to occur in parallel and reduces waiting time. However, each
register can only handle one update (store operation) at one time.
85
Recently, serial buses have become the most common form of buses even for short distance
communication. A serial bus running at a significantly faster clock rate can outperform a
parallel bus and at a cheaper price as well.
The Personal Computer (PC) is a class of desktop computers available at an affordable price
for people to use in home and offices. This section reviews the different types of buses found
in generations of PC.
The PCI bus (Peripheral Component Interconnect) is a standard for connecting a computer to
peripheral devices.
• PCI is a multi-point bus and a parallel bus between the IO controller hub (the south-
bridge) and PCI devices.
• Configuration: clock rate from 33MHz to 66MHz, with data width 32-bit or 64-bit, giving
throughput from 133 MB/s to 533 MB/s.
• PCI supports plug-and-play, and the device interrupt identifier is assigned by firmware
rather than using jumpers.
• PCI has a variant called PCI-eXtended (PCI-X), which runs on clock rate of 133MHz,
giving bandwidth of 1066 MB/s.
The PCI Express bus is a standard for connecting a computer to peripheral devices.
• PCI-Express is a point-to-point and a serial bus. Data and signals are transferred on
lanes.
• However, link between 2 PCIe devices may operate on different number of lanes,
depending on the need of throughput. High demand applications such as graphics can run
on multiples of PCIe lanes.
• Data transmission on a multi-lane connection is interleaved, that successive bytes are
transferred on different lanes.
• Data rate is around 250MB/s per lane, and a 16-lane connection is capable of around
4000MB/s.
• First-generation PCIe is constrained to a single signalling-rate of 2.5 G bits/s. The figure
of 2.5 GB/s is a calculation from the physical signalling-rate (2500 M baud) divided by
the encoding overhead (10bits/byte). This means a 16 lane (x16) PCIe card would then
be theoretically capable of 250 * 16 = 4000 MB/s (3.7 GiB/s) in each direction.
86
Accelerated Graphics Port (AGP)
The AGP Port (Accelerated Graphics Port) is a bus for connecting video device to computer.
The point of connection is often the primary controller hub to the main memory and the CPU.
• AGP is a point-to-point bus and a parallel bus. It is superseded by PCI-Express bus
already.
• Originally designed as 8-bit bus (at 4.77MHz), and subsequently upgraded to 16-bit (at
8MHz).
• AGP has a variety of speed and size
o AGP 2x: 32-bit, 66MHz, double-pumped (data transfer)
o AGP 4x: 32-bit, 66MHz, quad-pumped
o AGP 8x: 32-bit, 66MHz, eight-time per clock cycle
o AGP 2x throughput: 4 bytes x 66MHz x 2 = 533MB/s
The ISA bus (Industry Standard Architecture) is an old standard of computer bus on PC
connecting peripheral devices.
• ISA is a parallel and multi-point bus.
• Originally designed as 8-bit bus (at 4.77MHz), and subsequently upgraded to 16-bit (at
8MHz).
• The EISA improvement extends the bus further to 32-bits (at 8.33MHz) and allowed
more than one CPU to connect to the bus.
• ISA supports DMA and an early version of plug-and-play, which did not perform well.
87
• First-generation Serial ATA interfaces, also known as SATA/150, run at 1.5 gigahertz.
Because Serial ATA uses 8B/10B encoding with an efficiency of 80% at the physical
layer, this results in an actual data transfer rate of 1.2 gigabits per second (Gbit/s), or 120
megabytes per second.
• This transfer rate is only slightly higher than that provided by the fastest "Parallel ATA"
mode, Ultra ATA at 133 MB/s (UDMA/133).
• With the release of the NVIDIA nForce4 chipset in 2004, the clock rate of SATA II was
doubled to 3.0 GHz, for a maximum throughput of 300 MB/s or 2.4 Gbit/s.
USB Bus
USB bus has become the popular means of peripheral connection.
• 1-bit serial
• Hot-pluggable
• USB devices driven by host computer
• USB 2.0 (60MB/s) USB 3.0 (400MB/s), compared to Firewire 400 (400Mbit/s) and 800
(800Mbit/s).
88
4. Memory Technologies
The range of memory technologies for computer spans across many dimensions: speed, cost,
and other characteristics. While a computer designer could ignore the cost issue and choose
the best memory technology, the market would however favour using the memory type fit for
the purpose.
Data stored in a memory system is structured based on a unit of data. The size of the basic
unit of data varies from one type of memory to another. It could be 1-bit, 8-bit, 32-bit, 64-bit
and so on. This is often referred as the word size.
Each data unit is uniquely identifiable with the address of the data unit. The address is an
essential parameter for load/store operations of a memory system
89
• Compactness: the space occupied by memory can be an important consideration when it
is integrated with other computer components. Usually the smaller the better.
• Throughput or data transfer rate: the time taken to transfer an amount of data. Usually
measured in the same way as throughput in buses (Mbytes per second)
• Latency: the time taken for a memory system to begin data transfer. Some memory
systems (such as CDROMs, hard disks or tape-drives) require a setup time after receiving
instructions to perform read/write operations. Other memory systems purposely add
latency in order to achieve a higher throughput (such as DDR RAM).
The following shows a typical memory hierarchy of a desktop computer. There are varies
types of memory, each with different characteristics and purposes.
Memory in Processors
• Registers in processors are very fast, compact, and mutable memory. The operating
speed must be fast enough to match the internal processor clock rate. It should be
compact and therefore fit into the physical package of processors. Memory technology
suitable for this purpose is costly and volatile.
• Cache memory is processors are used to mirror part of the main memory. If the required
data and instruction is already in processor, then access to main memory through front-
side bus can be avoided. Cache memory should also be fast, compact, and mutable.
Therefore, the cost would prevent the amount of cache memory included.
Primary Memory
• Primary memory is the memory system that is directly addressable from the processor. In
other words, data in the primary memory can be directly referred with instructions.
• Primary memory is often known as the Main Memory system. The Main Memory should
be large size, less costly, and mutable. A large amount of Main Memory is critical for the
execution of programs, especially in multi-programming systems. The Main Memory is
often not physically contained within processor and compactness is not a major concern.
• IO Cache and Buffer is sometimes part of the Main Memory. They can make IO
operations more efficient.
Secondary Memory
• Secondary memory provides long-term data and program storage.
• The demand for capacity is higher given the larger total amount of data handled by a
computer system.
• It is often non-volatile and low-cost. The available technologies for secondary memory
are slower and latency is often not a concern.
• The media for storage determines whether the secondary memory device is mutable.
90
The Main Memory
The Main Memory is the memory system that feeds the processor with instructions and data.
It works closely with the processor in the fetch and execution cycle.
• In the fetch phase, processor needs to load the next instruction from the Main Memory.
• In the execution phase, processor may execute an instruction that involves data from the
Main Memory.
The Memory Address Register (MAR) and the Memory Data Register (MDR) form the
interface between the Main Memory and the CPU.
• The MAR specifies the address of the memory required.
• The MDR holds the data for the transaction.
The MAR and the MDR are connected to the Main Memory in the following manner.
• The MAR holds the address in 8-bits or its multiple (depending on the addressable space).
A decoder converts this 8-bit address into a set of activate lines, which only one is
activated according to the value of MAR. The activated line connects to the memory
cells of the address.
• The memory cells on the activated address line are connected to the MDR. Then MDR
can either read values from the memory cells, or write new values to them.
The following lists the three main buses and signal lines involved in the operation of memory.
• There are usually 32 lines, 64 lines or 128 address lines corresponding to the address size
of 32 bits, 64 bits and 128 bits. The number of address lines is exactly the size of MAR
used in the CPU. One of the activate lines is activated according to the value represented
by the address lines.
• There is usually a R/W line associated with the MAR/MDR to indicate whether this
memory access is a read or a write operation.
• There are also multiple lines connecting the MDR to the cells of each memory address.
91
The following figure shows the design of a basic memory system:
The CPU and the MAR/MDR operates in the fetch phase of the fetch-execution cycle in
following manner.
• The content of the Program Counter Register is copied to the MAR, which is the address
storing the next instruction.
• The R/W line is set to read.
• The content of the given address is stored in the MDR (with previous value overwritten).
The content (instruction) is copied to Instruction Register (IR).
The CPU then examines the instruction stored in the IR to determine the actions to follow. If
the instruction is to store the value of an accumulator to a memory location (similar to the
STO instruction in the LMC), then the following happens in the execution phase of the cycle.
• The address part of the instruction is copied from the IR to the MAR, which is the address
where data is to be stored.
• The R/W line is set to write.
• The data in the accumulator is copied to the MDR. The content of the MDR is stored to
the memory cells activated by the MAR and the decoder.
In modern PC computer systems, the MAR and MDR are part of the Memory Management
Unit (MMU) that also performs other memory related functions.
92
Operation of Main Memory Systems
Operation of the main memory system can occur when the MAR, MDR, and the address R/W
line are loaded with data. The loading of data takes time. The operation is therefore
synchronized with a memory clock so that the loading of data and the operation can take
place at correct timings.
An electronic clock on a computer is a signal that goes between high and low repeatedly.
Memory operation occurs according to the signal, and usually triggered by the edge (rising or
falling edge) of the clock.
The rising edge or falling edge is useful because it represent a time instance that every data
(or signal) involved are ready.
The clock rate (or frequency) has a bearing on the speed of the memory. A slow clock rate
would means that the memory operation occurs less frequently, and therefore slowing the data
movement.
However, we cannot simply increase the clock rate without making other considerations.
Memory operations take time to complete. So the clock rate must allow the completion of
one operation before triggering the next one.
93
Semiconductor Memory: Static RAM and Dynamic RAM
Current main memory system is based on semiconductor memory. A standard circuitry called
flip-flop can store 1-bit of data. A memory system can be designed with millions of these
circuitry integrated together.
• Random access memory (RAM) refers to such memory system in which the stored data
to be accessed in any order.
• RAM based on flip-flops is called static RAM (SRAM). SRAM is fast and non-volatile
as long as powered and volatile if there is no power.
• Each flip-flop is made up of 6 to 8 transistors, which can take up some space if larger
memory size is to be packaged.
Packaged RAM chip is available in various standard shapes. The manufacturing process of
RAM is similar to that of semiconductor microprocessors.
An alternative technology is called Dynamic RAM (DRAM).
• Dynamic RAM (DRAM) store data as charge on a capacitor, arranged in an array or table
of cells. So the array of cells provides the storage of multiple bits of data.
• The capacitors used in DRAM tend to lose their charge quickly, and therefore require a
periodic refresh cycle (in milliseconds) or data will be lost. Therefore a memory
subsystem is required to support this refreshing.
• DRAM is less expensive, requiring more power, smaller in size, compared to SRAM.
94
Improving the Throughput
DRAM is significantly slower than SRAM. A processor may have to wait for 4 to 6 cycles
before DRAM can make the data available.
There are variants of DRAM that are designed to provide better data throughput through some
clever designs.
The above example shows that one memory operation requires two cycles (the real case
should be more). The first cycle is REQ, and the second cycle is READ.
95
Video RAM (VRAM)
VRAM is designed for video adapters. VRAM systems are dual-port. It allows
simultaneously read and write operations.
• RAM is normally a single port device. The CPU can perform reading or writing but not
both at the same time.
• With VRAM, the PC can write into the memory to change what will be displayed, while
the video adapter continuously reads the memory to refresh the monitor's display. The
performance is greatly increased.
Early PCs supported 64K for video RAM, not 256 K. This might surprise you as the video
card has 256K video RAM size. To fit this RAM into the 64K space, the RAM is paged.
Programs can only access a small portion (or PAGE) of video RAM area at a time. Some
newer cards map their entire memory directly into the PC's RAM space in high memory
(above 1024K) hence creating a video aperture. Only Windows-based operating systems, not
DOS, can support such cards.
This technology is superseded by DRAM technology.
96
Double Data Rate DRAM (DDR-RAM)
DDR RAM is the current mainstream memory technology.
• The DDR DRAM serves data in the beginning and ending phase of a memory cycle,
therefore serving double amount of data.
• A bus frequency of 100 MHz can allow a single channel DDR RAM to serve 1.6 GB/s.
• PC-1600 64-bit DDR RAM using DDR-200 chips runs on 100MHz bus has a single
channel output of 1.6GB/s.
o Transfer rate = 100 MHz (memory clock rate) x 2 (for dual rate) x 64 (number of bits
transferred) / 8 (number of bits/byte).
DDR RAM makes use of both rising and falling edge - double pumped.
• DDR-200 (PC-1600): 100MHz = 1.600 GB/s
• DDR-266 (PC-2100): 133MHz = 2.133 GB/s
• DDR-400 (PC-3200): 200MHz = 3.200 GB/s
DDR2 DRAM series allows internal memory clock to run faster than bus clock - fetch double
more data on average in one cycle.
• DDR2-400 (PC2-3200): 100MHz (Memory) = 200MHz (IO) = 3.200 GB/s
• DDR2-800 (PC2-6400): 200MHz (Memory) = 400MHz (IO) = 6.400 GB/s
• Higher latency that makes DDR2-400 performing worse than original DDR
DDR3 DRAM series allows internal memory clock to run faster than memory clock - fetch
quadruple more data in one cycle.
• DDR3-800 (PC3-6400): 100MHz (Memory) = 400MHz (IO) = 6.400 GB/s
• Even higher latency that results in long access time.
97
5. Input and Output Devices
IO design is an often-overlooked issue. Many people are more concerned about the CPU
speed. However, the performance of a computer system often rests on IO performance. IO
devices are significantly slower than CPU and memory speed.
There are many attributes separating one IO device from another. The following lists the
major characteristics:
• Transfer data unit: character-stream or block transfer.
• Relation between IO operations and programs: synchronous or asynchronous data
transfer.
• Data access order: sequential or random access.
• IO device exclusiveness: Sharable or dedicated device.
• Data mutability: read and write allowances.
• IO device latency: fast or slow setup time.
• IO operation speed: high or low data transfer rate.
For example, a keyboard is a character-stream, asynchronous, sequential, dedicated, read-
only, fast setup, and low data transfer rate IO device.
On the other hand, an electro-mechanical hard disk is a block transfer, asynchronous, random
access, sharable, read and write, slow setup time, and high data transfer rate IO device.
98
IO Operation and Latency
Latency is a major issue in IO operations. The following diagram explains the detail stages of
IO operations.
99
Synchronous and Asynchronous IO
The first method is known as synchronous IO and the second method asynchronous IO.
We need different architectures and services to handle these types of IO.
• Advantage. Asynchronous IO allows the user program to do something else while the IO
device is handling the request.
• Disadvantage. Asynchronous IO is more difficult to for the programmer to write program
to manage exceptional situations, such as an error occurring while the program is doing
something else.
100
IO Design for Computer Systems
IO devices are running at a significantly slower speed compared to the CPU and the Memory
System.
The basic design strategy is to separate them into different worlds of speed, in the same way
as the Memory System is separated from the CPU.
The following figure shows how IO devices are connected to IO controller hub. The hub is
then connected to the memory controller hub, before reaching the CPU. The bus speed is
decreasing as the bus moves further away from the CPU.
IO devices are connected to the bus leading out from the IO controller hub.
• Each device controller handles a specific type of device. Sometimes one controller can
manage more than one device (such as SCSI).
• Each device controller has a local buffer, which is used to hold data when data is
reading/writing between the computer system and a device.
• Device controllers operate independently from the CPU.
101
Sending instructions to IO devices
The IO device is now separated from the CPU by at least two controllers in the current
programmable computer design. The CPU cannot directly send data or signals to IO devices.
While the CPU can directly send data to the Memory System through the MAR/MDR
registers, there is no such mechanism built in for IO device.
Port-mapped IO
Port-mapped IO uses dedicated instructions for IO operations. An example is the LMC
instructions IN and OUT. These instructions are handled directly by the CPU and the CPU
sends signals directly to the IO devices to carry out the operation. The IO devices have their
own memory space for data movement between the CPU and the IO device.
Memory-mapped IO
Another solution to this problem is to provide memory-mapped IO. Memory-mapped IO
unified the access to the Memory System and access to IO devices. Sending signals or data to
an IO device is now done by writing data to the Memory System. Certain areas in the
Memory System are declared special places. Any data written to one of the areas is read and
handled by an IO device. On the other hand, an IO device sending data back by writing to the
areas.
The following figure illustrates how memory-mapped IO operates.
CPU transfer data to an IO device by writing instruction to the mapped data registers and set
the control register appropriately. The device controller monitors the control register, takes
the data, and then clears the control registry for next data transfer.
102
Signalling from IO Devices
The CPU has one of the two following options after sending an instruction to an IO device:
• Synchronous IO mode. The CPU will wait and keep polling if the IO operation is still
going on.
• Asynchronous IO mode. The CPU will forget about the IO operation for the time being
and does something else.
In asynchronous IO mode, the CPU will wait for a signal from the IO device when the IO
operation is complete or an error has occurred. The signal is known as interrupt.
IO Interrupt is handled in the following steps.
• After the CPU receives an interrupt signal, the execution of the current program is
suspended.
• The program counter (PC) is saved so that the execution will return to the suspended
place later.
• The controller then refers to the type of the interrupt and the CPU is made to execute a
segment of code according to an interrupt vector.
• Interrupt vector is a table of pointers usually stored in the lower part of the memory.
• The pointers are the starting addresses of interrupt service routines (ISR) that are
designed to handle a particular type of interrupt.
After the interrupt is handled, the CPU is made to return to executing the address of the
program when the interrupt occurred.
The following figure illustrates the steps involved in handling an IO interrupt.
103
6. Input and Output Device Case Study: Hard Disk
Hard disk is arguably the most important IO device of modern computer systems. Hard disks
are currently based on magnetic disks technology, which has been serving us since 1960's
despite facing a lot of challenges from other technologies.
Hard disks contribute to modern computer systems in two major ways.
• Provide non-volatile long-term storage for data and files.
• Provide secondary memory to supplement the main memory during the operation of
computers. Data from the main memory can be moved temporarily to hard disk to spare
some space for other data.
Magnetic disks are physically composed of platters of solid disks stacked up on a rotational
spindle.
Platters are usually made of metal or glass, deposited with magnetic materials on both sides.
So one platter has two surfaces for data storage. Each platter is divided into a number of
cylinders, and then tracks. Typically there are tens of thousands of tracks on a platter.
Each track is further divided into sectors. A sector is the smallest unit involved in read/write.
Because the outer tracks are longer, usually more sectors are designated there. This scheme is
called constant bit density.
To perform a read/write operation, a moving arm with a read/write head is moved over the
track of the desired sector. This is called a seek operation and the time required to move the
arm is called the seek time.
After the read/write head has moved to the desired track, it may not be over the desired sector.
The time for the head to move over the desired sector is called rotational delay or rotational
latency. It needs to wait until the rotation of the platter. If the platter is not already moving,
further delay would be taken into consideration.
Read/Write Performance
The time taken to read/write data on magnetic hard disks must take the following overhead
into consideration.
• Seek time. The time taken for the read/write arm to move over the desired track.
• Rotational Delay. The time taken for the desired sector to be rotated under the read/write
head.
• Controller time. The time taken for the IO controller to process an IO request.
• Queuing time or queuing delay. A hard disk can serve one request each time. Other
requests must wait and queue for the service for the hard disk.
104
A typical performance specification of a modern magnetic hard disk is shown in the
following.
Fujitsu Hard Disk MHT2160BT
Model MHV2160BT
Storage capacity(formatted) 160.0 GB
Bytes/Sector 512
Seek time Track to track 1.5 ms typ.
Average 12 ms typ.(Read), 14 ms typ.(Write)
Maximum 22 ms typ.
Rotational speed 4,200 RPM
Data transfer to/from host 150 MB/s
Interface SATA
Buffer size 8MB
Answer:
The time taken to transfer 512 bytes can be worked out as the following.
Average disk access is the sum of the following: average seek time, average rotational
latency, transfer time, and controller time.
Average seek time = 12 ms
Average rotational latency = 50% × (1 / 4200 RPM) × 60 = 7.1 ms
Transfer time = 512 bytes / 150 MB/s = 0.003 ms
Controller overhead = 0.1 ms
Overall average disk access = 12ms + 7.1ms + 0.003ms + 0.1ms = 19.2ms
Note that the data transfer (0.003ms) only contributes a small percentage of the average
disk access. The various seek time (including the average seek time and rotational latency)
factors are predominant
To speed up the disk access, we should first focus on the seek time which contributes 90
percent of the overall disk access time.
To minimize seek time, one can read more data than request, and hoping that the next
requests happen to use the data read-ahead. The success of read-ahead lies on the
observation that requests has a property of spatial locality. When data is stored on hard
disks, the data is arranged in a sequence on neighbouring sectors and tracks.
105
7. Integration: PC Motherboard Design
The motherboard is a printed circuit board where the major computer components are
integrated together. In addition to the circuitry connecting the components, it also provides
power, cables, connectors, and physical housing.
A motherboard is designed around the features of the main processor. Apart from the main
processor, it contains a chip set for providing other functions such as memory and IO control.
There are two chips that work with the CPU: one is called memory controller hub (north-
bridge) and the other IO controller hub (south-bridge).
• The north-bridge connects the CPU and the following components: main memory, AGP
bus (video), and the south-bridge. The north-bridge determines the performance of the
data transfer between the memory and the CPU (often the deciding factor in system
performance), and the type and amount of memory that can be used.
• The south-bridge is detached from the CPU, and it is responsible for handling slower
communications. The separation of the two allows the critical high speed transfer
between the CPU and memory to happen without the intervention of the slower
communication between peripheral devices, which is the main role of the south-bridge.
• The south-bridge includes an interrupt controller that allows peripheral devices to alert
the CPU.
• The south-bridge also includes a DMA controller that allows data transfer between IDE
hard disks and the main memory.
• The CPU and the north-bridge are connected with a high-speed bus known as the front-
side bus (FSB). The speed of CPU is determined from the speed of the front-side bus
times a multiplier. Example Intel technology for FSB is known as GTL+ and AGTL+.
The CPU connects to the L2 Cache Memory through the backside bus.
The FSB is often regarded as the bottleneck of performance of this classic PC design. The
all-important memory to CPU data transfer running on the FSB is shared with data write-
back, and data of IO operations.
106
Motherboard for Core Duo Processor
The following diagram is adapted from Intel information, showing the specific components
and the bus data rates of a motherboard design for Intel Core Duo processor.
Internet
LAN
Internet
LAN
107
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
This chapter will discuss a number of case studies, each of which examines a particular
feature used to improve computer system performance.
• Increasing Clock Rate
• Adding a CPU Cache
• Adding General Purpose Registers
• Adding an addition System Bus
• Direct Memory Access (DMA)
The following three approaches can summarize the features for improving computer
performance.
• Performing a task faster.
• Performing tasks in parallel.
• Avoid performing certain time consuming tasks.
108
1. Increasing Clock Rate
Increasing the clock rate is a simple method to improve the execution performance. An
increased clock rate can make more actions to take place in the same period of time. For the
same number of RTL steps, doubling the clock rate theoretically reducing the time taken by
half.
There are several limitations concerning increasing clock rate to improve performance.
• All components have operating ranges and the clock rate is one of them. For example,
the Memory System might need at least 5 ns to complete a read/write operation, and so
the clock cycle cannot be shorter than 5 ns (or the clock rate cannot be greater than 1/5ns
or 200 MHz)
• Components running at high speed generate heat. A too high clock rate may generate
heat beyond the capability of the designated cooling device.
• Components running beyond the designated speed shorten the lifespan.
Processor
Frequency
System Clock Multiplier
1.33 GHz
x 10
133 MHz
CPU Multiplier
Over-clocking is carried out by enthusiasts to boost their computer performance. The above
diagram shows that there is a system clock driving the clock signals for other components.
Modern processors have a frequency multiplier that multiplies the system clock for the
internal processor clock. The ratio between the system clock and the internal processor clock
is the CPU multiplier.
• Increasing the system clock drives up both the processor and the memory system.
• Increasing the CPU multiplier can achieve over-clocking in the processor only.
• To improve component stability under over-clocking, increasing the operating voltage of
the component may help.
• The increased clock rate will cause more heat generation. Additional cooling should be
installed to help dissipating the heat.
Over-clocking can potentially damage computer components. It may also cause errors in
computation due to lower stability of the components. The processor performance gained is
actually not that significant.
This section is a theoretical discussion of over-clocking and it does not teach you how to
over-clock.
109
2. Adding CPU Cache
The clock rates for CPU and main memory are significantly different. More importantly, the
potential throughput is also significantly different. Increasingly the main memory has
become a major performance bottleneck. The throughput main memory cannot satisfy the
fast processor’s demand for instructions and data.
CPU cache is a feature that reduces the processor’s dependency on the throughput of the main
memory.
The concept of Memory Wall was postulated by Bill Wulf and Sally McKee in 1994. They
predicted that the divergence of CPU speed and Memory speed increment would soon make
all computer performance to be dominated by Memory speed.
110
Memory Operations
The very first design of programmable computer assumes that all the components, including
the CPU and the Memory, are operating at the same clock rate.
Look at the following fetch part of a LMC instruction in RTL. The CPU expects the Memory
to have the data ready at the MDR in one clock cycle.
ADD 20
PC > MAR
M[MAR] > MDR ; expects the memory to have the data ready in one clock cycle
MDR > IR
Consider that the CPU technology has made advancement and it can operate at a faster clock
rate. However, the Memory System's speed has remained the same. With our current design,
the CPU has to operate at the same speed as the Memory, therefore unable to exploit the
improvement in CPU performance.
A solution to this problem is to separate the two buses and to allow them running at different
speeds. A bus controller is placed between the two buses to perform coordination of data
movement between the two.
The CPU can now run at a faster clock rate, but there is another problem. The program
instructions are still located at the Main Memory. During the execution of each instruction,
the CPU still has to wait for the slower Memory System to respond. A solution is to prevent,
as far as possible, the CPU to access the Memory System.
111
CPU Cache
CPU memory cache is a very fast memory subsystem located within the CPU. The act of the
CPU cache making a copy of the data in the main memory is called caching. Caching is a
technique that allows the CPU to operate more independently of the Memory System. If the
required data or instruction is already within the CPU, the CPU needs not read from the
Memory System.
The following figure describes a scenario of how caching would allow the CPU to operate
independently.
Consider that the computer is executing a small LMC program that involves a loop.
• As the execution begins, the CPU reads from the Memory System each instruction one by
one.
• The CPU needs to wait every time for the slower Memory System to catch up.
• The key operation is that the CPU is saving each instruction in the CPU memory cache.
When the loop is executed the second time, the instructions to be executed are already
available in the CPU memory cache. The CPU can read directly from the CPU cache and
can operate with fewer accesses to the Memory System.
The ideal CPU cache is one that always stores the required data (or instruction). Memory
operation is not needed at all.
The cache hit rate is the percentage of requests that can be satisfied by the cache. The ideal
CPU cache would have a hit rate of 100%.
Theoretically a high hit rate can be achieved by:
• Larger cache size.
• More accurate predictions about the future data requests. Keep such data in the cache.
A drawback of larger cache size is higher latency due to search operation. A search operation
in the cache would be involved with every memory operation. Although the cached data is
organized efficiently in a data structure, the search time still increases with larger cache size.
Multi-level cache is designed to provide a better balance between cache size and latency. A
lower level cache (e.g. L1 cache) is smaller in size and so latency is low. The second level
cache is of a larger size, and subsequently higher level cache has increasing size. The search
begins with the low level cache first.
112
Locality of Reference
CPU cache manager can often make correct predictions about future requests. Even there is
no crystal ball.
The locality of reference is a phenomenon that the next memory request is more likely to be
at a nearby address than at any other addresses. For example, if the current memory request is
at address 1000, then the chance of requesting address 1001 is higher than any other
addresses.
The locality of reference exists because of some properties of program execution, including:
• Sequential execution model: usually the next instruction to execute is at the next address.
• Local branching: even if a branching occurs, the address to branch would be nearby if
branching is associated with a condition (if-else) or repetition (for-while) structure.
• Array processing: data in an array is usually assessed from one end to another end.
Coherency in Caching
With caching, a piece of data would appear in several copies in different storage facilities.
Modification to the value in one place would render the other places incorrect. This is
actually acceptable provided that there is only one process accessing the data.
In multi-programming environment, this is a potential problem because there could be another
program accessing the other copies of the same data. This happens when the CPU switches
from one process to another. The situation is more complicated in multi-processor
environment, and in a distributed environment. There are many methods to guarantee
consistency but each of them requires effort to manage the data copies.
Registers are at the top of the memory hierarchy and they are very fast memory.
Types of Registers
General purpose registers are also called user-visible registers. They can be directly
manipulated with instructions, such as storing/loading data for such registers.
The LMC has only one general purpose register, the accumulator (ACC). It can be
manipulated with LDA, STO, ADD, SUB, IN and OUT instructions.
Specific purpose registers are used to support the operation of the CPU in the fetch and
execution cycle. Examples of specific purpose registers include the following:
• Program Counter Register (PC) for storing the program counter.
• Instruction Register (IR) for temporarily storing the instruction loaded from the memory.
• Memory Address Register (MAR) for holding the address of a memory location where
data may be loaded or stored.
113
• Memory Data Register (MDR) for holding the data involved in a load/save operation with
a memory location.
• Status Register for holding the various statuses during the operation of the CPU such as
arithmetic errors (e.g. overflow or carry, low power, etc). A status is indicated with a
flag, often 1-bit wide. For example, the 8086 and 8088 chips have a 16-bit status register,
storing the following flags: had many flags: carry, parity, auxiliary carry, zero, sign, trap,
interrupt enable, direction, overflow, IO protection, nested task, resume, and virtual 8086
mode.
• IO Registers for holding the data and identity of the IO device. This is not often used in
modern architectures.
• There are other special purpose registers such as Constant Registers for holding special
value such as zero and one.
General purpose registers located in the processor are useful for storing intermediate and
temporary data. Most processes and operations involve a lot of steps. Each step would
consume data from the previous steps and generate data for the next steps. A programmer can
write instructions to store these data in general purpose registers.
LMC has only one general purpose register. A lot of memory operations are found in LMC
programs because intermediate data can only be stored in the main memory. There is no
spare register for storing these in the processor.
Memory operations are slow and the performance would be significantly improved if memory
operations related to intermediate data could be avoided.
The following LMC program shows such an example.
The data movement between CPU and Memory System can be reduced with more general-
purpose registers in the CPU.
114
Adding General Purpose Registers
Three new general-purpose registers to the LMC, named R1 to R3, are added to improve the
efficiency. The old accumulator ACC is renamed to R0. These three new registers are
connected to the CPU system bus. The following figure shows the revised design.
The new generation purpose registers are not usable without adding new instructions for
manipulation them. The following describes two new LMC instructions:
• MOV RA, RB: copy data from Register B (RB) to Register A (RA).
• SUB RA: subtract RA from R0 and store the result to R0.
The following shows the RTL steps for the two new instructions.
MOV RA, RB SUB RA
The two new instructions require 5 RTL steps. If each step takes one clock cycle, the
instruction takes 5 clock cycles.
• They take two less RTL steps compared to the original LMC SUB instruction, which
takes 7 RTL steps.
• Memory operations may take more than one clock cycle, and so comparatively two new
instructions are even faster because they carry out fewer memory operations.
With the new instructions MOV and SUB, the LMC program is rewritten as the following to
exploit the new general purpose registers.
115
Example: Revised LMC program
The instructions STO, SUB, and LDA involve loading or storing data from/to the Memory
System. It is because the CPU has only the ACC and no other place to store intermediate
data.
00 IN ; #1 store in R0
01 MOV R1, R0 ; R1 = R0
02 IN ; #2 store in R0
03 MOV R2, R0 ; R2 = R0
04 SUB R1 ; R0 = R0 – R1
05 BRP 08 ; if R0 >= 0
06 MOV R0, R1 ; R0 = R1 R1 stores #1
07 BR 09
08 MOV R0, R2 ; R0 = R2 R2 stores #2
09 OUT
10 COB
The revised program should perform better. The program is shorter and some instructions
also take shorter time to execute.
Question: Compare the execution of the two programs and evaluate the performance gain.
Answer:
A number of quantitative measurements can be used to compare the two programs. Two of
them will be used here: RTL steps (similar to clock cycles) and memory operations.
Normally, all instructions that would have been executed in the programs are taken into
consideration. The programs have no loop and making it easier.
The program has a conditional branch. For simplicity, only the case of first integer greater
than second integer is considered.
Number of instructions 11 11
116
4. Parallel Execution and Adding another System Bus
Performing tasks in parallel can certainly shorten the time to complete. However, parallel
processing must satisfy the following requirements:
• The tasks are independent. For example, if task B depends on the result of task A, then A
and B cannot be performed together.
• Additional resources are available for the parallel execution of the tasks.
This case study investigates the effect of parallel execution of RTL operations.
Consider that we have an instruction ADD R1, R0, R2 that adds R1 to R0 and the result is
stored in R2. This is a register-based instruction. In the execution phase, all data movement
happens on the system bus inside the CPU. The system bus design is shown in the following.
The RTL for the instruction is given below. There are 6 steps and it takes 6 clock cycles to
complete the execution (assume 1 clock cycle per RTL step).
PC > MAR
M[MAR] > MDR
MDR > IR
PC + 1 > PC
R[0] + R[1] > R[0]
R[0] > R[2]
If the 6 RTL steps could be executed in parallel, then it would just take 1 clock cycle to
complete the execution. There are a few reasons why this is not possible.
• A RTL operation is dependent on the result of a previous RTL operation. For example,
the second RTL operation requires the loading of MAR from the first RTL operation.
These two operations cannot happen together.
• Hardware design restricts the possible parallel operation. The last two RTL operations
require the system bus and so they cannot happen together.
117
The multi-point bus can only support one pair of components to communicate at a time. Let
us examine which of the steps use the system bus. Step #1, #3, #5, and #6 happens on the
system bus.
If step #2 and step #4 do not use the system bus, the problem is to consider whether it is
possible for them to happen in parallel with any other steps.
• Step #2: Not possible because step #2 cannot occur before completion of step #1. Also,
step #3 cannot occur before completion of step #2. There is data dependency between the
first 3 steps.
• Step #4: The increment of PC is caused by a signal from the Control Unit. This step can
happen in parallel with step #5.
The following shows the timing information of the execution of the instruction. The
instruction now takes one less time cycle to complete.
Adding one more system bus can increase the opportunity for parallel execution. The
following figure shows a possible design based on two system buses.
The output port of the ALU should be connected to both system buses to facilitate the
movement of data. However, the two system buses cannot be connected because they should
be carrying different data. A pair of control gates is placed at the output port of the ALU to
control which system bus is connected to the output port.
118
With the additional system bus, the last two steps in the instruction ADD R1, R0, R2 (in
below diagram) can happen in parallel.
Adding an additional system bus provides opportunities for parallel execution of the steps in
the fetch and execution cycle. There is cost implication of adding a bus, however, and the
designer must weigh the cost and benefits.
119
5. Direct Memory Access (DMA)
Direct memory access is a technique that allows IO-to-Memory operation to occur in parallel
with processor execution of instructions.
IO-to-Memory operations occur quite frequently in modern computers:
• Loading programs from hard-disk to the main memory before execution.
• Loading data for program processing.
In the current computer design, the CPU needs to take care of IO operations through handling
interrupts, even in asynchronous IO operation mode.
• CPU executes an instruction to initiate an IO operation.
• CPU continues to execute other instructions, and leaving the IO operation to run in
parallel.
• IO operation completes and raises an interrupt.
• CPU suspends the current execution and handles the interrupt.
• After the interrupt is handled, the CPU resumes the suspended execution of instructions.
For a busy high speed device handling many requests, there will be too many interrupts. Each
interrupt will hamper the smooth operation of the CPU and the CPU is forced to do a lot of IO
handlings instead of executing programs.
One solution to free the CPU from IO activities is to allow the IO devices to communicate
with each other independently.
• High-speed devices use a method called direct memory access (DMA), in which device
controller transfer a whole block of data directly between the main memory and the
device local buffer.
• Only one interrupt is generated per block.
• A DMA controller is instructed by the device driver (in the OS) of the address of a buffer
(in the main memory) and the length of data to copy. CPU can do other things
independently.
• The major problem is the Memory System can serve only one request at a time. DMA
still competes with the CPU for memory system access.
The following figure shows the operation of DMA.
120
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
Instruction set architecture concerns with the programming aspect of computer architecture or
computer design. A computer designer can include many good features that allow a computer
system to run efficiently. However, the programmers must be able to make use of these
features through programming.
The instruction set can determine the efficiency of programs through the skills of
programmers.
This chapter discusses instruction set architecture through Extended LMC (e-LMC). E-LMC
provides an instruction set for illustration of various concepts related to instruction set design.
1. Overview
121
• Format of instructions. Generally instructions and data are of the same representation in a
von Neumann computer. The LMC instructions are in the form of 3 decimal digitals.
o A typical instruction has two items to be fit into the representation: operation
code (op-code) and operands.
o For example, the first digit of LMC instructions is the op-code. However, if
the op-code is 9, then all three digits form the op-code. The format is said to
be variable.
The main purpose of instruction is no doubt processing data. An important consideration in
instruction set architecture is to determine how a processor stores data.
There are generally three types of architecture concerning data storage in a processor:
• Stack architecture. Data and instructors are stored in a stack in the processor.
• Accumulator architecture. Data is mainly stored in the accumulator (ACC).
• General-purpose register architecture. Data is stored in one or more general-purpose
registers in the processor.
122
Limitations of LMC Instruction Set
123
Example: Array Traversals
LMC unsuitable to support some high level language constructs, such as array traversal and
pointers. The following is a C program segment.
int array[10];
int i = 0;
If there were a C compiler for LMC, the compiler would be bounded by the limitation of the
LMC instruction set. A possible LMC code would be generated as the following. The code
generation process would decide that the array memory is address 80 to 89 and the variable i
is at address 90.
124
Example: Pointers
int* ptr;
int i = 0;
ptr = &i;
*ptr = 10;
The following shows an equivalent LMC program. The variable ptr is allocated at address
90, and variable i at 91.
In instruction set architecture, implicit operands and explicit operands are different in their
visibility in the instructions:
• Implicit operands are assumed in the specific instructions and they are not part of the
instruction format.
o The IN instruction of LMC have an implicit operand of accumulator (ACC).
The LMC code for IN is 901, which does not contain an operand in the
instruction format.
125
• Explicit operands are visible in the instruction format.
o The STO instruction of LMC copies the ACC value to a memory address.
The STO code is 3XX, where XX is the memory address operand. It has one
explicit operand (the memory address) and one implicit operand (the ACC).
Instruction length should be as short as possible to minimize memory usage and memory
operations. Implicit operands do not occupy space in the instruction format, and instruction
set architects would make operands of some instructions implicit for better performance.
Question: How many implicit and explicit operands are there in the LMC ADD
instruction?
Answer:
The ADD XX instruction carries out an addition operation on the ACC and the value in a
memory address. The result is stored in ACC.
ACC = ACC + MEM[XX]
There is one explicit operand, which is the memory address
There is one implicit operand, which is ACC.
Operands in LMC instructions are referring to the desired location where the value can be
loaded or stored. For example, LMC instruction 5 08 is LDA 08, in which the explicit
operand 08 refers to the memory address storing the value to be copied to ACC. The implicit
operand ACC is the location to receive the value.
However, theoretically an operand value can be interpreted in different ways. Given the
operand value 08 above, these are some interpretations:
• 08 is the memory address of the referred location: LDA instruction copies the value in
memory address 08 to ACC.
• 08 is the value: LDA instruction copies 8 to ACC.
• 08 is the ID of a general-register: LDA instruction copies the value in register R8 to ACC.
• 08 is the memory address holding the memory address of the referred location:
The various interpretations are known as the different addressing modes for the operand.
126
4. Extended LMC
This section introduces the Extended LMC (E-LMC), with a new instruction set that has
incorporated some new features in the computer. The following summarizes the new features
in the E-LMC:
• Memory addressing space is extended to 1,000. The addresses range from 0 to 999.
• General purpose registers R4 to R7 are added. The accumulator (ACC) is preserved.
• Constant registers R0 to R3 are added.
• Two output devices are supported: (1) a seven-character LCD display based on ASCII
encoding and (2) a 3-digit LCD display based on signed decimal encoding.
• Two input devices are supported: (1) a buffered num-pad for entering a 3 digit decimal,
and (2) a buffered keyboard for entering a character.
• Memory-mapped IO is used instead of port-mapped IO. The IN and OUT instructions are
removed. Memory addresses 990 – 999 are reserved for input/output. Address 990 is
mapped to a 3 digit decimal output. Addresses 991 – 997 are mapped to an 7 ASCII
encoded character based output device. Address 998 is the buffer for the character based
input, and address 999 is the buffer for the 3 digit decimal input.
The above changes in the E-LMC require corresponding changes in the instruction set
architecture. New instructions should be added to take advantages of new features such as the
4 general purpose registers and the 4 constant registers.
• Register based instructions are added. They are for manipulation of the data stored in the
registers.
• Arithmetic and logic instructions are added. They are for improving the programmability.
Examples include multiplication and division.
• Memory-to-memory copy instruction is added.
• Instruction length is variable.
E-LMC is not backwardly compatible with LMC. LMC programs cannot run on E-LMC.
Maintaining backward compatibility is often difficult without paying the price in aspect of
performance and design extensibility. For example, E-LMC supports 1000 memory addresses
but the old LDA and STO instructions support address range from 0 to 99 only.
127
Structural Diagram of E-LMC
E-LMC has four general-purpose registers, which should reduce the number of memory
operations.
• The general-purpose registers have ID from 4 to 7.
• The register ID will be used as an operand for instructions involving the general-purpose
registers.
E-LMC also has four constant registers, which are immutable and read-only.
• The constant registers have ID from 0 to 3.
• The register ID will be used as an operand for instructions in the same way as the general-
purpose registers.
The following table shows the constant values stored in the constant registers.
Instruction format is the way how the op-code and the operands of instructions packed
together in an instruction.
Logically sound and style consistency for instruction format design is important:
• Consistent format helps programmers to learn and prevents them from errors.
• Logically sound design facilitates processing of instructions in the processor and
improves performance.
E-LMC instruction format has the following features:
• Most instructions are two-word long and a few are one-word long.
• The op-code is always in the first word in two-word instructions. However, the op-codes
may occupy the first digit, the first-two digits, or all three digits.
• The first word in an instruction can be used to work out if the instruction is two-word
long.
• For some instructions such as the arithmetic instructions, there is more space in a two-
word format than required. The remaining space is padded (i.e. ignored).
o The last digit of the second word in the arithmetic instructions is padded.
The following shows the format of the E-LMC instructions graphically.
130
The following shows the different variants of LDA.
• The digit that acts as padding can be filled in anything. The processor would ignore it.
• The register addressing modes (direct and indirect) usually allows a short instruction
length. General-purpose register ID is usually one digit long. It would only occupy one
digit space in instruction format.
The following shows the register-based instructions, including the move instruction and some
arithmetic instructions. Again register-based instructions are short.
The CPY instruction is the longest one in E-LMC. It copies a data block of a length
(L) from a source address (SAddr) to a destination address (DAddr). The order of data
copying is from the beginning to the end. It has three explicit operands:
• Length: an integer from 0 to 99.
• Source address: an address from 000 to 999.
• Destination address: an address from 000 to 999.
131
Exercise: Instruction Format
The following shows an alternative instruction design for the two operand instruction ADD
RN, RM. It performs addition on two general-purpose registers:
RN = RN + RM.
Comment on this design.
Answer:
• The instruction length is 2 instead of 1.
• Padding is applied to word 1 and word 2. The two operands need two digits.
• The opcode must be distinguished from the current instructions. The opcode 19 is
used.
• Using one digit for the opcode is acceptable, but the opcode cannot be ‘1’ because the
need to differentiate between opcode of instructions. So the opcode ‘2’ is used because
there is no other instruction with opcode starting with ‘2’.
• It would however reduce the possible opcode available for adding new instructions.
132
The following lists the important points about the E-LMC program.
• The register-based instructions have no memory operation in the execution phase. The E-
LMC program uses a lot of these instructions and it should run significantly faster.
• E-LMC provides constant registers. The DAT definitions are not needed here.
• The instructions used in the E-LMC program are mostly of length 1 but one instruction
has length of 2 words. The last instruction HLT is at address 17 instead of 16, because
the STO instruction takes up 2 words.
The following shows the source code of the E-LMC program.
133
5. Addressing Modes
The address mode of an operand in instructions is expressed using the following syntax.
The addressing modes covered in this chapter are the most common ones. In the real
world there are processors designed with many address modes, though most of them
are combinations or variants of the common addressing modes.
Processor Remarks
Intel 8086 17 addressing modes
Pentium 17 addressing modes (backward compatibility)
Itanium 1 addressing mode (register indirect addressing)
MIPS Register addressing mode mainly
Java bytecode Register indirect with offset in a stack architecture
134
Example: Operations of LDA of Different Addressing Modes
The following gives the content of a range of main memory addresses and some registers.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3
Work out the value loaded into ACC after the execution of the following instructions based
on LDA 22.
• Direct addressing LDA 22
• Immediate addressing LDA #22
• Indirect addressing LDA (22)
• Register addressing LDA R5
• Register indirect addressing LDA (R5)
• Register Index Relative Addressing LDA R5+2
Answer:
135
Register Index Relative Addressing LDA R5+2
ACC will contain 21. The operand 2 is the base address and R5 contains the offset. The
resolved address is 23(which comes from R5) +2 = 25. This address contains the desired
data 21.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3
Work out the values in the address range given above after the execution of the following
instructions based on STO 32. Assume that ACC contains 18.
• Direct addressing STO 32
• Indirect addressing STO (32)
• Register indirect addressing STO (R6)
• Register Index Relative Addressing STO R5+32
Answer:
136
6. Instruction Execution in E-LMC
E-LMC is no different from LMC that the execution of instruction follows the fetch and
execution cycle.
• Some instructions are two word long. If a memory system supports data transfer size of 2
words, should the fetch phase always read in 2 words at a time?
• The operations of the execution phase can vary greatly. Most instructions do not require
memory operation in the executions phase. Their execution phase is short. A few
instructions are based on indirect or index relative address modes, the execution phase of
these instructions is longer.
The following shows the operations of fetch and execution cycle of various LDA instructions
in E-LMC. It is assumed that the memory fetch is 1 word each time.
137
LDA RN+Addr (Register Index Relative Addressing)
PC > MAR
M[MAR] > MDR
MDR > IR (loaded the first word)
PC + 1 > PC
PC > MAR
M[MAR] > MDR
MDR > IR (2nd op is the base address)
IR + RN > MAR (add the value of RN the offset, by control unit)
M[MAR] > MDR
MDR > ACC
PC + 1 > PC
• The requirement to load instructions in two memory operations (the first word and the
second word) increases the number of RTL steps.
• Some RTL steps could happen in parallel, such as the increment of PC, so that the
number of steps is reduced.
• The Instruction Register (IR) can store multiple words.
The following table summarises the differences between variants of LDA in the number of
memory operations.
138
The E-LMC branch instructions are all in direct addressing mode.
The following table summarizes the number of memory operations of the register addressing
instructions and the branch instructions.
139
7. General Design Issues
Issues to consider in instruction set architecture design:
• The available functions.
• The addressing modes supported.
• The instruction format.
Programmers generally want more functions and therefore more instructions available, but
larger number of instructions increases the complexity. Computer designers must strike a
balance between performance and programmability:
• Each instruction needs a unique op-code.
• The space designated for op-code determines the maximum number of instructions
possible.
• Allowing longer instructions increases the space to cram op-codes and more operands
into an instruction format, but longer instructions need more memory operations to load.
There are two more common issues to consider in instruction set design.
• Number of explicit operands.
• Fixed instruction length design or variable instruction length design.
Low number of explicit operands can keep the instructions size small.
The nature of the instruction determines the total number of operands.
• Arithmetic operations including addition and subtraction have two operands.
• Negation and branch have one operand.
• Halt has no operand.
However computer designers can make an operand implicit in the instruction and reduce the
size of instruction. For example, E-LMC assumes that one operand in ADD is the ACC.
Some instructions that have a lot of operands must inevitably add size to the format of
instructions. A computer designer is to decide whether to include these instructions in the
instruction set.
140
Fixed Instruction Length Design
In the fixed instruction length design approach, every instruction is of the same length. LMC
is fixed length, while E-LMC is variable length.
Fixed length allows more efficient instruction fetch.
• The fetch phase can read in 2 or 4 words at the same time.
• Some memory systems support fetching multiple addresses in one operation.
• MDR and IR sizes are larger to store more words in one instruction.
• The number of RTL steps is reduced.
The following shows an example of LDA under fixed instruction length design.
However, fixed instruction design takes up more memory for storing instructions.
The length of all instructions is as same as the longest instruction in the set. For example, the
CPY instruction in E-LMC has length of 3. All other instructions are padded so that their
length is also 3.
141
8. CISC and RISC Architectures
There are two fundamentally different philosophies in instruction set design for processors.
The Complex Instruction Set Computer (CISC) philosophy is that a processor should
provide a large and rich set of instructions for its programmers and make efficient use of
memory.
• A typical CISC CPU supports as many as two hundreds instructions.
• The rich and flexible set of instructions eases the programming task and reduces the
number of instructions required to implement a program.
The philosophy of Reduced Instruction Set Computer (RISC) is that the performance of a
CPU can be greatly enhanced with simplifying the instruction set of the CPU.
• A RISC CPU has a small instruction set and executes their instructions extremely quickly
because the instructions are so simple.
• A typical RISC CPU, such as the SUN SPARC CPU, supports as few as 52 instructions.
In the state of the art processor design, the boundary between CISC and RISC architectures
are becoming more blurred.
The following describes an example from each of the CISC and RISC architecture
approaches.
PowerPC CPU
• A RISC based processor.
• The instruction set has 224 instructions (divided into 6 categories: integer, floating point,
load/store, branch, processor, and memory control instructions).
• Fixed length instructions (all instructions are 32 bits long).
• Instructions may have zero to five operands.
• Most instructions use register addressing mode; only load/store and branch instructions
use memory addressing.
• There are around 70 registers for program use.
• A pipelined, superscalar architecture, with multiple different execution units, branch
prediction and out-of-order execution.
• A branch history table to improve branch prediction.
Pentium CPU
• A CISC based processor.
• The instruction set has 336 instructions (28 system, 92 floating point, 52 multimedia
extension, and 164 integer, logical and other general instructions).
• Variable length instructions.
• Instructions support zero to three operands.
• There are 12 different addressing modes.
• There are eight general-purpose registers and eight float point registers for program use.
• There are two five-stage pipelines, but does not use out-of-order processing technique.
143
COMPS266F Computer Architecture
Copyright Andrew Kwok-Fai LUI 2017
This chapter discusses several architectural concepts that are pertinent to the design of high
performance computer systems.
• Super-scalar Processing and Pipeline Architecture
• Multi-core
• Mainframe Computing
These architectures are designed with performance scalability in mind. In other words, they
have the extensibility and flexibility to handle large-scale data processing tasks. Parallelism
is the basis of these architectures. The capability to carry out actions in parallel can achieve
greater performance boost than carrying out actions faster.
Pipeline architecture is an example of instruction level parallelism. It allows multiple
instructions to be executed at the same time.
Multi-core computer is an example of thread level parallelism. Individual threads can be
executed at the same time by individual cores.
1. Performance Metrics
The main role of computers is to perform tasks for people. The performance of a processor is
commonly expressed as the average number of instructions executed in a second.
• Instruction is the smallest unit of a recognizable task.
• The clock rate of a processor is the number of clock cycles per second. In each clock
cycle, the processor can take one step. The clock rate can indicate the work rate of the
processor.
• Due to different instruction set architectures, different processors take different number of
clock cycles to execute an instruction.
o For example, a processor running on a faster clock rate is not necessarily the
better performer. The processor may need many more clock cycles to
complete the execution of one instruction.
This performance measurement is usually expressed in the unit of MIPS (millions of
instructions per second), as processors are typically fast enough to execute over several
million instructions per second.
144
Answer:
(i) Computer A takes 3 seconds to execute 300 million instructions. The millions
instructions per second (MIPS) is 300 million / 3 seconds = 100 MIPS. Computer B takes
5 seconds to execute 100 million instructions. The millions instructions per second (MIPS)
is 100 million / 5 seconds = 20 MIPS.
(ii) There are a few reasons: (1) Computer A and B supports different instruction sets, and
so the same program is compiled with two sets of machine code. (2) The compilers are not
of the same quality and so one of them might have generated poor and inefficient code. (3)
The clock rates of the 2 computers are different. One of them may be slower.
The MIPS measurement has some merits for comparing performance of processors, but it
does not take into account the amount of work actually done by an instruction.
• Complex instructions generally take more clock cycles to complete than simple
instructions.
• For example, ADD (addition) and MUL (multiplication) are two instructions of different
levels of effort.
• Without MUL in an instruction set, a program would need a number of ADD and other
instructions to perform multiplication.
• One cannot compare the performance of two ways of doing multiplications without
looking into the detail performance parameters.
Computer A Computer B
Clock Rate 1.5 GHz 3.0 GHz
MUL Instruction Clock Cycle 35 cycles Not provided
To perform a multiplication operation on Computer B, the best programmer can achieve
with the execution of 20 instructions, with an average 2 cycles per instruction. Which
computer can perform a multiplication operation faster?
Answer:
Time to execute one multiplication operation on Computer A:
One MUL instruction
Time = 35 cycles / 1.5 x 109 cycles per second = 23.3 x 10-9 seconds ~ 23.3 ns
Time to execute one multiplication operation on Computer A:
20 instructions with average 2 cycles per instruction.
Time = (2 cycles/inst * 20 inst) / 3.0 x 109 cycles per second = 13.3 x 10-9 seconds
~ 13.3 ns
Computer B can execute a multiplication operation faster, but it depends on the skill of the
programmer to write efficient code.
145
Performance for Enterprise Computing
Enterprise computing refers to the application of computing technologies for large-scale
business applications. Computing solutions for banks, financial institutions, logistics and
government are often based on enterprise computing technologies. These users are more
concerned with the number of tasks completed, and these tasks are business transactions,
processed orders, and requests handled. The performance measurement is therefore the
number of such tasks completed in a second.
The following are some figures obtained from a test of applying IBM Power 750 with 32
POWER7 cores in a bank (Reference: http://www.ameinfo.com/record-breaking-unmatched-
results-ics-banks-305031):
• 30000 concurrent users and 14700 financial transactions per second.
• 51431 transactions per second in ATM and Internet Banking activities.
• 401606 interest accounts processed per second.
Benchmarking
Benchmarking is the technique that compares the performance of two different computers by
measuring the time that each one takes to complete a set of particular programs. Benchmark
programs are a specially designed set of programs for measurement purposes.
• For a particular benchmark, the same workload is given to a set of computers to test their
performance.
• Benchmarking provides a common standard for comparing performance.
• Benchmarking is especially important for comparing computers of different architectures.
o Computers of same architecture may be compared at the design level:
instructions per cycle, clock rate, etc.
o Computers of different architecture have different instruction sets are difficult
to compare conceptually.
There are a number of industry standard benchmarks. These benchmark standards have been
scientifically tested so that the test results are consistent and re-producible. Here are some
examples:
• Standard Performance Evaluation Corporation (SPEC)
• Business Applications Performance Corporation (BAPCo)
Benchmarks are usually specific to a particular workload. Here workload means the type of
computer applications. Typical workloads are Business applications and Graphical
applications.
• The type of instructions executed by a Graphical application is different from that by a
Business application.
o Graphical application typically performs more floating point arithmetic (for
2D and 3D coordinate calculation)
o Business application typically performs more integer data movement and
some integer arithmetic.
• A CPU that is efficient on data movement and integer arithmetic will perform better with
business applications. The same CPU will not perform as well with Graphical
applications.
146
2. Pipeline Architectures and Instruction Pipelines
Pipeline architectures achieve very high performance by executing multiple instructions in
parallel. The time taken for individual instructions does not reduce. However, the overall
throughput is improved because there are more instructions executed per second.
Parallel execution of instructions is difficult to realize. The following shows three
instructions running in a sequence.
If the three instructions were to be executed at the same time, then theoretically the following
would happen.
147
Basically the above model of instruction execution is similar to the Register Transfer
Language (RTL) perspective. The following shows the RTL for LMC ADD.
Instead of expressing in RTL, we are now viewing instructions as consisting of these four
phases: Fetch, Decode, Execute, and Write.
Some instructions will take longer in the Execute phase and other instructions are longer in
the Write phase. If an instruction needs one clock cycle to complete each of the four phases,
the total time required would be four time cycle. The following figure shows the phases
executing in a sequence in the time.
The following figure shows that when multiple instructions are executed at the same time,
more than one instruction is performing the same stage. For example, in clock cycle #3, all
three instructions are in the Execute stage. Three execution units such as ALUs may be
required.
148
Instruction Pipelining
Instruction pipeline describes a process that processes an instruction in several stages. The
output of one stage is passed to the input of another stage.
The separation of several stages has one major benefit: instructions may be executed at the
same time without the need for more execution units or multiple instances of other
mechanisms.
In instruction pipelining, each instruction is being handled at a different step in the instruction
cycle.
• CPU can handle several instructions at the same time but they are all at different stages.
o In the second time cycle below, the CPU is executing the Fetch phase of
instruction #2 and Decode phase of instruction #1.
• Each stage is handled by a dedicated component.
o A Fetch component is handling the Fetch stage of an instruction, and a
Decode component is handling the Decode stage of another instruction.
• The components in an instruction pipeline should operate independently and at the same
time.
149
Scalar Processing and Super-Scalar Processing
Instruction pipeline can theoretically achieve almost one clock cycle per instruction. If there
is no break in the execution, then the continuous overlapping of instruction execution can get
close to scalar processing.
The ability to execute one instruction in a clock cycle is called scalar processing. A CPU of
this class is thus called a scalar processor.
Superscalar processing is a design that employs more than one execution unit within the
CPU so that multiple instructions can be executed simultaneously. Superscalar processor can
execute more than one instruction in one clock cycle on average.
For example, a superscalar processor may contain one fetch unit and two execution units.
• The single fetch unit of the processor can fetch several instructions at a time.
• The fetched instructions are saved in the instruction buffer within the processor before
being fed into the execution units.
• The execution units can then performs steps of the execution phase of two instructions in
parallel.
150
Exercise: Super-Scalar Performance
A CPU has a Fetch component can fetch 6 instructions in one time cycle. The CPU has
two sets of Decode, Execute, and Write-Back components. Draw the instruction execution
status in the instruction pipeline and evaluate the performance.
Answer:
The Fetch component maintains the 6 instructions in the buffer until they are all decoded.
Then the Fetch component can fetch the next 6 instructions.
The above pipeline can achieve super-scalar performance in the long run. For example, the
above shows that 10 instructions can be executed in 9 clock cycles.
151
3. Efficiency and Hazards of Instruction Pipelines
The most efficient pipeline depends on a regular pattern of clock cycle in the various stages in
an instruction. The efficiency would drop if one of the stages takes two clock cycles instead
of one.
Pipelines are at the most efficient when all instructions in the same pipeline have the same
pattern of time cycles in various stages of execution.
For example, the following figure shows that the execution step for an instruction consuming
more than one clock cycle.
There are general pipelining hazards that will affect the performance of an instruction
pipelines.
Data hazards
Data hazards happen when an instruction depends on result of previous instruction that is
still in the pipeline.
• For example, the third step of an instruction needs a result that stored in a register from
step 4 of the previous instruction.
• Stalling the pipeline is a common solution to this kind of hazards is to insert one or more
stalls (wait states) in the pipeline.
152
Control hazards
153
Structural hazards
Structural hazards mean that hardware cannot support the running of two instructions at the
same time even if they are in different stages
• For example, both steps require access to memory but there is only one memory port for
accessing data.
• Like the data hazard, a solution to structural hazards is to stall the pipeline by inserting
one or more bubbles in the pipeline.
154
Instruction Set Architecture and Pipeline Efficiency
Generally an instruction pipeline can be at most efficient if all instructions in the instruction
set are of the same pattern. Same clock cycle spent on each Fetch, Decode, Execute, and
Write-Back component.
Instruction pipelines work more effectively with RISC type of instruction set.
• Most instructions are of the same pattern of execution.
• Many instructions are register based and so they have fewer structural hazards due to
memory system bottleneck.
155
Modern Superscalar CPU
The processing power of modern CPUs is partially depending on instruction pipeline with
multiple execution units.
• Different types of execution units that are tailored to the needs of different types of
instructions.
• A complex steering system that can send instructions to various execution units.
• An algorithm to manage operands and retire instructions in correct program order.
• Able to process instructions out of program order in order to keep superscalar processing
effective
The following figure, adopted from Englander Page 221, illustrates the major components of
a modern CPU design.
A CPU equipped with multiple parallel execution units allows instructions with the same
pattern of execution be put in the same pipeline.
The CPU instruction decoder distributes instructions to a number of parallel execution units.
Each execution unit is optimized to perform one type of instructions.
156
4. CPU Implementation Approaches
The hardwire implementation approach design dedicated hardware logic and circuitry for
each instruction. Then all these hardware logic and circuitry are embedded into a single chip.
• Each instruction had its own hardwired logic path to follow when being executed in that
CPU.
• The hardware logic circuits are then combined together to form the control unit of the
CPU.
• The control unit controls the state of the instruction cycle with the help of timing signal
generator. At the end of each stage in the execution cycle, the control unit issues signals
to tell the timing signal generator to initiate the next stage.
Advantage: This approach is straightforward to implement and worked well for simple CPU
architectures.
Disadvantage: This approach is not flexible to change. Consider what needs to be done when
you want to upgrade the CPU by adding several new instructions and modifying a few of the
existing instructions.
157
Microprogramming implementation approach
The microprogramming approach is based on the following observation that no matter how
complex an instruction, it can be broken down into a series of fundamental operations within
the CPU.
• Data movement: moving data from one register to another.
• Arithmetic and logic functions: performing simple arithmetic or logic functions on data in
registers.
• Conditional branches: making simple decisions based on the values stored in flags and
registers.
Rather than building separate hardware logic for each and every instruction, a number of
simple hardware logic units are built for internal CPU operations, and these internal CPU
operations are then used to form the instructions of the CPU.
The set of fundamental CPU operations is called microinstructions. These microinstructions
are then programmed to form the actual instructions of the CPU. The tiny programs that form
the CPU instruction set are called microcode. The CPU has built-in read-only memory to
store the microcode.
158
The following shows an example of executing the ADD instruction. The control unit
executes the micro-instructions according to the sequencing logic in the microcode library.
159
5. Multi-Core Processors
Multi-core processors refer to processors that contain one or more processing units in a
physical chip package. Each core is an independent processing unit with its own cache
memory to providing instructions and data. One core can execute one program or threads,
and so multiple cores can execute multiple programs or threads at the same time. Multi-core
processors are examples of thread level parallelism.
The following figure (left) shows the architecture of a standalone computer. The figure
(right) shows two computers networked together. The following lists the characteristics of
this configuration.
• Two networked computers can run two programs at the same time. The throughput is
increased.
• There are two main memory systems. However, data exchange between the two relies on
the network (probably local area network). The data transfer rate is not fast, limiting the
potential for cooperation.
The following figure (left) shows a typical configuration of a multi-core computer system.
The multi-core processor below has four cores (commonly called quad-core). Each core has
its own cache memory and there is a connection to the single main memory system.
However, the connection is shared between the cores. This is called shared memory model.
160
The following lists the characteristics of this configuration:
• There is a single bus connecting the cores and the main memory, allowing a good data
transfer capacity between them.
• Each core is expected to execute individual threads or programs, of which the instructions
should be usually located in the local cache memory.
• When a core requires loading data from the main memory, then the performance
bottleneck of the shared memory model comes into play. Only one core can access the
main memory at a time.
Multiple-level cache memory is often used to reduce the chance of accessing the main
memory. Core i7, for example, has three levels of cache. The Level-3 cache is shared
between the cores.
161
6. Enterprise and Mainframe Computing
(This section is adapted from the IBM Academic Initiative course on mainframe, and it is
used with permission)
Enterprise computing is the style of computing that satisfies the information processing need
of large enterprises.
• Very large amount of data (i.e. transaction data in stock market exchange)
• High availability (i.e. almost never break-down)
• Integrity and security (i.e. that the data are safe and correct is guaranteed)
• Scalability (i.e. the system capacity can increase gracefully)
A mainframe is what businesses use to host their commercial databases, transaction servers,
and applications that require a greater degree of security and availability than is commonly
found on smaller-scale machines.
Strengths of Mainframes
Strengths Description
Reliability Hardware: provides self-checking and recovery from error ability.
Software: extensively checked and tested.
Availability Usually measure in Mean Time Between Failure (MTBF) which may be months or
years in modern mainframe. Able to continuously operating while dealing with errors
or scheduled upgrade.
Serviceability Provides information about the source of failure and allow a rapid problem fix.
Security Provides a framework to manage authentication and prevent unauthorized access.
Scalability Provides a flexibility to change capacity with minimal impact on the operation and the
cost.
Continuing Enterprises typically invest a lot of money on application development on mainframe
Compatibility and it is important that such applications will continue to function even after decades.
162
Mainframes in the Modern World
Mainframe computers are usually hidden from public eyes. However they are the driving
force behind many essential day-to-day activities.
• Many of the Fortune 1000 companies use a mainframe system.
• Over 60% of all data available on the Internet is stored on mainframe systems.
• There are at least 10,000 mainframe systems still running in the world.
• Most banks in Hong Kong are supported by mainframes.
The yearly revenue generated from mainframe computing is still between 4 to 6 billions US
dollars. There are 2,000 to 3,000 mainframe systems shipped every year.
IBM is the largest mainframe vendor and probably the only large vendor still in the market.
IBM has continuously enhanced mainframe computing with the most current technologies.
The modern mainframe computer is no longer a room-size computer system. It has now
included distributed computing, cloud computing and virtualization in its armoury.
The current IBM mainframe systems are called the System/Z series. The following lists the
core features of IBM zEnterprise System introduced in 2010:
• The processor z196 chip is a quad-core 5.2GHz CISC processor.
• The z196 system can support a maximum of 24 processors.
• Each core may be assigned a specific role such as a typical Central Processor or an
Application Assist Processor for running Java/XML.
• Maximum memory is 3TB.
Mainframes Architectures
163
The specifics of the architectural features are usually worked out rigorously from Service
Level Agreement (SLA).
• A SLA is an agreement between a service provider and a recipient about the level of
performance required.
o For example, a bank may want 99% of ATM transactions to be completed in
one second.
• The number of processors, IO bandwidth, memory, etc is worked out from the required
performance level.
The following figure shows the conceptual structure of a traditional mainframe.
• Central processor contains processors, main memory, and other control circuitries.
• Large capacity data processing is enabled through a large number of channels, each of
which connects IO devices (such as hard-disks) to the memory storage.
• Processing capacity is scaled up through connecting IO devices to more Central
Processors. The Control Units manage the paths of data movement between IO devices,
channels, and other Control Units.
• More Central Processors can be connected when the processing capacity requirement is
increased.
The newer mainframe computers have more advanced features in IO connectivity and
configuration and system partition.
164
Mainframes Management
The physical appearance of a mainframe computer is based on frames. Frames are places
where mainframe components called cages are fixed. There are two types of cages:
• One Central Electronic Complex (CEC) cage. Contains the processor units (PU), the
physical memory, and connectors to the other cages.
o The CEC cage contains one to four books, where processors and memory are
put together as a physical unit.
• One to three IO cage. Each contains connection to external IO devices.
The hardware configuration, system images, etc is managed through a hardware
management console.
Processors
Mainframes are multi-processor systems. Each processor may be given a specific role and to
perform specific work.
• Central Processor (CP): to support normal operating system and application execution.
• System Assistance Processor (SAP): it provides a high reliability and availability IO
subsystem. It manages multiple paths to control units and performs error recovery.
• Integrated Facility for Linux (IFL): a Central Processor that cannot support z/OS, which
is an operating system provided by the IBM for Systems z. IBM charges less for this type
of processors.
• Integrated Coupling Facility (ICF): it couples together several z/OS based systems to
form a collaborative system.
• Spare: for use when there is a failure in other CPs, or to support Capacity upgrade on
Demand (CuOD) such as sudden increase in processing need.
Direct Access Storage Devices (DASD) are advanced version of the typical hard disks used in
personal computer.
• DASDs are usually housed physically in a different location than the processors.
• A DASD has multiple disks arranged in a sophisticated manner for higher throughput and
higher reliability.
165
Virtualization and Partitioning
166