Sie sind auf Seite 1von 167

The Open University of Hong Kong

School of Science and Technology


Computing Programmes

Lecture Notes

COMPS266F
Computer Architecture

2017 Presentation
Copyright © Andrew Kwok-Fai Lui 2017
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 1. Introduction to Programmable Computers

The computer architecture part of the course will begin with solving the problem of how to
design a programmable computer. We will begin with explaining the meaning of a
programmable computer.

1. Computer Systems and Computing Process

Computers or computer systems are not necessarily referring to the computer you have on
your desktop. Computers are devices that can calculate.
The word computing originally means calculating. As with many other English words, the
meaning of computing changes with the development of the society and the technology.
• In the modern era, computing may refer to anything from day-to-day operations of
financial institutions, creating documents and spreadsheets, programming, data mining,
statistical analysis, controlling a spaceship, etc. Modern computers are electronic devices
that you can purchase from a computer shop. They allow you to play computer games,
write documents, carry out financial planning, and talk to friends across the world.
• In the old days, computers were often mechanical devices. Abacus is an example of a
computer that was great for simple arithmetic. For example, Wilhelm Schickard
invented a digital mechanical calculator in the 1623 that used metal gears and levers. He
was known as the father of the computing era. Mechanical computers were becoming
common until 1930 to 1940, when electronic devices were used to build computing
systems. The Electronic Numerical Integrator and Computer (ENIAC), one of the first
computers, was designed for ballistic calculation using over 17,000 vacuum tubes.
ENIAC was specially built for this particular purpose but it could be partially re-purpose-
able by rewiring.

References:
http://en.wikipedia.org/wiki/History_of_computer
About the history of development of computer systems. Read the story about how the
British invented the first modern computer but they were forbidden to reveal it because
it was a wartime secret.
The following shows a typical computing process. This process reads input data, processes
the data, and then writes output data. The process has access to a data storage which can be
used to store data for future use. The stored data can turn back and influence the process.

Two examples of computing process:


• A census of Hong Kong collected a lot of data about her citizens. The process converts
the raw census data into more informative statistics such as mean family size and income.
• An online bookstore receives a purchase order from a customer. A process handles the
purchase order by storing the current order and uses previous orders to make
recommendation of other books for the customer.

2
2. Programmable computers

A programmable computer is one which can be re-purposed.


• Such a computer can perform new tasks with programming. Programming is the act of
writing computer programs, which means putting together instructions in a purposeful
way. A programmable computer can execute these instructions one at a time. By
composing computer programs with different instructions, a programmable computer can
carry out different computing processes.
The programmability of a system is the degree of how the system can be re-programmed.
• A high programmability system can be programmed to serve a wider range of purposes.
A modern general desktop computer system has high programmability. It can be used for
word processing, video editing, watching television, playing games, and many others.
• On the other hand, a DVD recorder is programmable but has low programmability. It can
be used to record video using different modes, at different times, in different channels.
However, it cannot be re-purposed to do other things.

The following gives examples of programmable systems from low programmability to high
programmability.
• A toaster with a time knob
• A washing machine with programs for various types of clothing.
• A DVD recorder supporting various recording modes.
• A programmable calculator supporting programmed sequences of calculation steps.
• An Excel spreadsheet supporting functions and macros.
• A modern general purpose computer system

The following lists some ways of achieving high programmability.


• A large amount of instructions available for composing programs.
• A high flexibility to compose and sequence the instructions in many different manners.
• A short time and little effort to re-program a system.

3
3. Components of a Programmable Computer

We will identify the components of a programmable computer.


We solve this problem by referring to the computing process model and then listing out the
major items relevant to a programmable computer:
• programs: contain instructions
• instructions: which are commands executable by the computer
• data: to be processed by the computer

These items will perform some processes. The major processes are listed below:
• instruction execution: an essential function of the programmable computer
• data storage: a function for storing the data before and after the instruction execution
• program storage: a function for storing the program in the programmable computer for
instruction execution
• inputting data and program: a function for data and program to go into the computer from
the outside world
• outputting data: a function for data to leave the computer to the outside world
The last two processes are essential because a computer cannot exist in isolation. A computer
useful for any purpose must be able to interact with the outside world.
These processes are refined and their roles are abstracted into the following major
components for a programmable computer.
• Arithmetic and Logic Unit (ALU): for instruction execution
• Memory system: for data and program storage
• Input: for data input into the programmable computer
• Output: for data output from the programmable computer

4
4. Introduction to Arithmetic and Logic Unit

The first component of a programmable computer is the Arithmetic and Logic Unit (ALU).
The ALU is a functional unit responsible for the execution of instructions.
• The execution of instructions is an essential function of a programmable computer.
• The input to the ALU includes the data and the instructions that command how to deal
with the data.
• The result of the instruction execution will appear at the output of the ALU.
The following figure shows a schematic diagram of ALU with its input and output. One input
channel is for sending in instructions and others for sending in data. The ALU can typically
many types of instructions, for example, add, subtract, and negation.

In the figure, the ALU has two data input channels and one data output channel. This is a
typical arrangement because most operations (instructions) have at most two operands.
• Addition: A + B. A and B are passed into the two input channels and the result of A+B
will appear at the output channel.
• Subtraction: A – B. The same case as addition.
• Negation: -A. This is a single operand operation. A is passed into one input channel.
The operation of the ALU is controlled by the instruction. For example, an Addition
instruction will make the ALU performing an addition operation on the input data. The ALU
will output data to inform other components of its status. For example, if an error occurs in
the calculation, then an error status may be emitted.
Different instructions are represented by different electronic signals which in turn
representing numbers. The ALU designer may specify that 01 representing Addition and 02
representing Subtraction. The coding of instruction is usually published in a technical
manual.
The ALU does not carry out any operation unless it is told to do so. The clock line connected
to the ALU sends regular signal to the ALU, in a way similar to the alarm "beep beep beep"
sound. Upon receiving a beep sound, the ALU executes one instruction. Then it executes
another instruction when the second beep sound arrives.

5
5. Memory
The second component is Memory. In a programmable computer, memory is a component
supporting the following functions:
• Store data (Write)
• Retrieve data (Read)
• Overwrite a previously stored data (Overwrite)

The left figure shows a schematic diagram of Memory with its input and output. There is one
data channel.
• Because one can write data to the memory as well as read data from the memory, the data
channel serves two ways. This communication pattern known as duplex.
• The read/write line is used to control the memory whether to Read or Write data through
the data channel.
• Similar to the ALU, Memory carries out actions as it receives signals from the clock line.
At each beep of the clock line, memory performs one data operation, whether it is a read
operation or a write operation.
• Each data operation involves a data unit. The size of data units varies from one memory
system to another.

A useful Memory should be able to store many data units. If there are more than one data
unit stored in Memory, there must be a way to identify each one.
• There is a unique address associated with each data unit stored in Memory.
• An address is usually a numeric value numbered sequentially from 0.
• The first data unit in Memory has address 0, the second unit has address 1, and so on.
• The number of addresses is equal to the overall size of Memory.
• The address line is used to specify an address for the current operation.

In the discussion so far, the word data has not been explained that whether it is a number or a
word or other things. This is a data representation problem and this topic will be discussed
later.
Example: Memory operations and the clock rate
Question: Consider that the clock signal to the Memory occurs 2000 times per second, the
Memory has 64 addresses of data units of storage. Calculate the amount of time required to
write data once to all data units.
Answer: Each write operation requires 1/2000 seconds. There are 64 write operations
required to write data to all the data units. The total time required is (1/2000) * 64 seconds.

6
6. Input and Output

The final components are Input and Output. These two components connect the
programmable computer and the outside world. The following shows a schematic diagram of
the two components.

There is one important point about Input and Output. The situation at these two components
is not under the control. The Input and Output are connected to the outside world, which is
beyond the realm of the programmable computer.
• The designers of the programmable computer can control how the components in the
computer are working. In particular through using the Clock line, the operation timing of
the Memory and the ALU can be controlled precisely.
• However, the Input may receive data any time, irrespective of the inner working of the
programmable computer. If there is data entering into the Input but the programmable
computer is not ready to receive them, then the data would be lost.
To solve the problem of potential data loss, a data buffer is added to the Input component. It
is used for storing any input data temporarily, until the programmable computer is ready to
handle them.
The programmable computer may generate a lot of data and write to the Output. It is up to
the outside world to capture the data. Any data loss is not the responsibility of the
programmable computer to ensure that the outside world can see all the data.
Sometimes, buffer is also added to the output for higher efficiency. For example, if the output
is connected to a remote web server and it is usually more efficient to send data in a larger
batch.

Example: Size of Input Buffer


Question: Consider the ideal size of the data buffer of the Input component.
Answer: The ideal size would be an unlimited data buffer. However, this does not exist in
the real world.

7
7. Summary

In this chapter we started the problem solving of "designing a programmable computer". So


far we have resolved the first sub-problem of "List the major components of a programmable
computer".
In the next chapters, we will resolve the remaining sub-problems:
• The design of Arithmetic and Logic Unit (ALU)
• The execution of arithmetic operations by the ALU
• The data representation in the ALU operations
• An effective method of feeding input data to the ALU and handling output data from the
ALU.
• A method to control the ALU to perform different operations.
• A method to control the ALU to perform multiple operations sequentially.
• An effective method of feeding both instructions and data to the ALU
• Enabling the programmable computer to carry out input and output operations.

8
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 2. Arithmetic and Logic Unit Design: Processing Integers

This chapter will focus on how to design the Arithmetic and Logic Unit for integer addition
and subtraction. This is the simplest ALU that can handle these two instructions.

1. Overview

Arithmetic and Logic Unit (ALU) is a component that can execute arithmetic and logic
instructions. The most basic ALUs execute few and simplest arithmetic and logic
instructions. The more sophisticated ALUs can execute very complex operations, and they
may contain a number of ALUs within them.
We will begin with the simplest type of ALU and consider first the problem of designing an
ALU that supports addition and subtraction of integers.

9
2. Requirements of a Good ALU Design

The functionality of the ALU includes addition and subtraction of integers. We however do
not just want any ALU; we want a useful ALU for our programmable computer. There are
additional desirable requirements of a good ALU design.
• High Reliability and Error Free. The ALU operations should be highly reliable. It is
useless to have an ALU that sometimes chunks out an error result.
• Simple Design. The ALU design should be simple for reducing the cost of producing
them and the cost of making improvement.
• Efficient Operation. The ALU should operate efficiently and so it will improve the
performance of our programmable computer.
• Large Data Range. The ALU should be able to handle as large data range as possible.
One always desires a calculator that can handle more digits.
• Economical Cost. The ALU should not be too costly to produce. Complexity and
power often comes with increased cost.
The above issues are all important to our design consideration.
However, a most important lesson to learn in this course is that in the real world, not all the
above issues would be taken with the same weight. We cannot have everything. We must be
selective.
• We have to sacrifice some of the desirable characteristics in order to achieve other
desirable characteristics.
• We must be prepared to give and take: give away the lesser important desires and
take the most important ones. This is known as a trade-off.

References: Trade-Off
http://en.wikipedia.org/wiki/Trade_off

10
3. The Main Approaches of our ALU Design

The ALU we are going to design will execute instructions and process data. We are
interested in the details of the operations, for example how addition or subtraction can occur.
• The data or numbers would not be in a written form. The addition would not be done
by pencil and paper.
• The programmable computer is an electronic device and the ALU is built using
electrical circuitry.
• Data and numbers will be coded as electrical signals and arithmetic operations will be
carried out electronically.
Our basic ALU design consists of the following main features:
• The ALU and the programmable computer will use digital representation to code
data in electrical signals.
• The ALU uses binary numeral system to code numbers for arithmetic and logic
operations.
• The ALU uses two's complement binary representation for addition and
subtraction of positive and negative numbers. Two's complement binary
representation is a variant of binary numeral system.
The above three main features help to justify that our ALU is the result of good design
decisions. We will explain these features in the following sections.

11
4. Digital Representation

The ALU will use digital representation to encode data. Data is an abstract entity but
eventually it must be represented somehow with a physical attribute. In electronic systems,
the common physical attribute to use is the voltage. Voltage is a continuous scalar.
There are actually two fundamental ways to represent data with voltages: analogue and
digital. Our ALU will use digital representation for greater reliability and error tolerance.
The following figure explains the analogue representation.
• Analogue representation is continuous. Any small changes in the signal can change
the original value into an incorrect value.
• Analogue representation can represent continuous data values, but it is not tolerant to
noise and other forms of signal degradation.

12
Digital representation represents data in discrete levels.

The following figure shows that a 2-level digital representation is a lot more error tolerance
than a 10-level digital representation.
• The levels are well defined and any sufficiently small fluctuation in the signal will
keep the signal at the same level. Digital representation is therefore less prone to
errors.
• One can decide the number of levels used in a digital representation. The error
tolerance decreases as more levels are defined.

Example: Digital Representation


Question: An electronic system can support voltage from 0V to 5V. If we design a system
that requires the representation of integers from 0 to 10, how should we assign the voltage
levels to the values?
Answer: There are 11 integers from 0 to 10. We can divide the range from 0V to 5V into
10 levels. The levels are: 0V to 0.5V, 0.5 to 1, 1 to 1.5, … . We can simply say 0.5V per
level.
13
5. Binary and Other Numeral Systems

The ALU will use the binary numeral system. The binary numeral system uses two symbols
to represent data '0' and '1'. Therefore its implementation requires a 2-level digital
representation, which is most error tolerant and less technically challenging. Normally, a low
voltage represents '0' and a high voltage represents '1' but it could be the other way round. A
even more reliable method is to encode '0' as a change of voltage from low to high, and '1' as
a change from high to low.
Numeral system supports a systematic and consistent set of rules to represent numbers. The
commonly known numeral systems include decimal, binary, and hexadecimal. Some key
characteristics of numerals system include the following:
• Each numeral system defines a set of numbers such as integers or positive numbers.
• Each numeral system can provide each number in the number set a unique
representation.
• Each numeral system contains of a set of unique symbols, each representing a certain
value in the set of numbers. In the decimal numeral system, the ten symbols are 0, 1,
2, ..., 9.
• Each numeral system provides rules for combining symbols to represent a larger
range of numbers and therefore it can support a large number set.
For example, the decimal numeral system uses the positional notation to combine the
symbols to represent numbers such as 32 and 1589. Larger numbers are constructed by
putting symbols together in juxtapositions.
The base of a numeral system is the number of unique symbols used in the system.

Base Numeral System


2 Binary
3 Ternary
8 Octal
10 Decimal
12 Doudecimal
16 Hexadecimal
20 Vigasimal
60 Sexagesimal

Decimal number system is the norm in today's societies, as in ancient China and the Hindu-
Arabic world. However, in the ancient world, there were all sorts of number systems.
• Vigesimal or base-20 used by Mayans.
• Duodecimal or base-12 used by Nigerians.
• Sexagesimal or base 60 used by Babylonians.
Decimal number system has 10 symbols, from 0 to 9. To represent values larger than 0 to 9,
we use the positional notation. Positional notation is based on a system that each digit is
related to the next by a multiplier, which is the base or the radix of the number system. In
decimal number system, the multiplier is 10. This means for one digit, the digit to the left
hand side is worth 10 times more multiplied by the value represented by the digit.

14
Example: Positional Notation in Decimal System
Question: Why is the number 3456 representing the value 3456 in the decimal numeral
system?
Answer:

Digits 3 4 5 6
3 2 1 0
Representation 10 or 1000 10 or 100 10 or 10 10 or 1
Value of digit 3000 400 50 6

Total value is the summation of the contribution of each digit, which is


3000 + 400 + 50 + 6

Example: Positional Notation in Octal System


Question: Octal system has 8 symbols from 0 to 7. In the same way, we can use positional
notation to represent larger values in octal. What is the value of the octal number 2476?
Answer:

Digits 2 4 7 6
3 2 1 0
Representation 8 or 512 8 or 64 8 or 8 8 or 1
Value of digit (in decimal) 1024 256 56 6

Total value is the summation of the contribution of each digit, which is


1024 + 256 + 56 + 6 = 1342 (decimal)

The number of symbols used in a numeral system is determined by the base.


• The base-2 or binary numeral system has 2 symbols 0 and 1.
• The base-16 or hexadecimal numeral system has 16 symbols, from 0, 1, 2, to 9, A, B,
C, D, E, and F.
• A single digit binary number can represent the values from 0 to 1 (decimal) only
• A single digit hexadecimal number can represent the values from 0 to 15 (decimal).
Comparatively, hexadecimal numeral system needs to support more symbols, but it can also
represent a greater range of values with the same number of digits.
The following table summarizes the characteristics of several common numeral systems.

Base Numeral System No of Symbols Range Represented by 8 Digits Range in Decimal


2 Binary 2 00000000 to 11111111 0 to 255
3 Ternary 3 00000000 to 22222222 0 to 6560
8 Octal 8 00000000 to 77777777 0 to 16777215
10 Decimal 10 00000000 to 99999999 0 to 99999999
16 Hexadecimal 16 00000000 to FFFFFFFF 0 to 4294967295

For the same number of digits, numeral systems of more symbols can represent a larger range
of values. A hexadecimal number FFFFFFFF (hex) is equivalent to a large decimal number
4,294,967,295 (decimal). The largest decimal number with 8 digits is only 99999999
(decimal).
However, the number of levels must eventually be implemented as digital representation.

15
We can increase the range of representation with another method. The method is to allow
greater number of digits.

Number of digits Range Maximum (Binary) Maximum (Decimal)


1
1 0 to 2 -1 1 1
8
8 0 to 2 -1 11111111 255
16
16 0 to 2 -1 1111111111111111 65535
32
32 0 to 2 -1 11111111111111111111111111111111 4294967295

The numeral system can only represent positive numbers (and zero). We must modify the
representation rules if we want to represent negative numbers. There are two possible
methods.
1. Adding a new symbol
• Adopt an additional symbol '-' to represent that the number following is negative.
The number -3456 means negative 3456.
• There are now 11 symbols instead of the original 10 symbols in the decimal system.
The drawback of the additional symbol is added overall complexity. For example,
each additional symbol needs a unique level of digital representation
2. Use a designated digit
• Designate the left most position to indicate the negativity. If the left most digit is '0',
it is a positive number. On the other hand, if the left most digit is '1', then it is a
negative number. So 03456 is positive 3456 and 13456 is negative 3456.
• The drawback of this approach is an additional digit must be added to every number.

Reading: Conversion between Numeral Systems


Appendix A.
It covers how to convert numbers between different numeral systems. The same article
also discusses that octal, hexadecimal can be viewed as a convenient form of binary
representation.

Our ALU will use binary numeral system because 2-level digital representation. This is the
most error tolerant representation.
• The basic binary numeral system uses two symbols 0 and 1 and the positional
notation to represent positive values.
• The number of digits determines the range of values that can be represented. A digit
in a binary number is called a bit.
• There are 8 bits in a byte.
• An 8-bit binary number can represent 256 different values.
• If the smallest value is 0, then the range is from 0 to 255.

16
6. Two's Complement Binary Representation for Efficient ALU Operations

The ALU uses a special type of binary representation called the two's complement binary
representation. Even if we have decided upon the binary numeral system, there are still
several ways to represent values with binary numbers. In order to justify our decision that
two's complement binary representation is the most suitable one, we should first review the
metrics for the suitability.
• Hardware reliability. All binary representations require 2 levels digital representation
and hardware implementation would be quite reliable. So this factor is not
considered here when we are comparing binary representations.
• Range of representation. A representation useful for many purposes should support
both positive and negative values.
• Utilization of resource. Given the same number of digits and symbols, each
representation can represent a certain range of values. For example, an 8-bit binary
system (two symbols) has 256 unique combinations. A representation that can fully
utilize the resource can assign all 256 combinations to 256 distinct values.
• Efficiency in executing instructions. The ALU will carry arithmetic and logic
operation. It may be very complex to implement sets of circuitry for operations based
on a certain representation. For example, additional with binary numbers is simpler
than addition with decimal numbers, because there are fewer combinations of
possible pairs of operands.
The two's complement binary representation is chosen because it has the following
advantages:
• It can represent both positive and negative values (and also zero).
• It can fully utilize resources. An 8-bit 2's complement number is mapped to 256
distinct values (from -128 to +127).
• It can support efficient addition and subtraction. Subtraction with 2's complement
numbers is a 2-step process involving a simple bit reversal and addition operations.
The circuitry for addition operation can be reused for the subtraction operation and
simplifying hardware design.
In the following sections we examine variants of binary representations and explain the
reason of choosing the 2's complement binary representation.

17
Positive binary representation
Positive binary representation has the smallest value set at zero. The following table shows
the range afforded by various bit-size of positive binary representation.
Number of Unique
Bit size Range Maximum (Binary) Maximum (Decimal)
Values
1
1-bit 0 to 2 -1 1 1 2
8
8-bit 0 to 2 -1 11111111 255 256
16
16-bit 0 to 2 -1 1111111111111111 65535 65536
32
32-bit 0 to 2 -1 11111111111111111111111111111111 4294967295 42949672956

Addition operation is relatively straightforward with this representation. Consider the


addition of two positive binary numbers A and B. Starting from the least significant digit of
A and B, add each pair of digit. The answer can be easily looked up from a (truth) table.
Carry needs to be handled between digit pairs but it is not difficult to implement.
The subtraction operation also starts from the least significant digit and works from right to
left. There is borrowing needs to be handled between digit pairs and it is not difficult.
However, the addition operation and the subtraction operation need two different sets of
circuitry.
There is always a possibility of overflow in addition operations and subtraction operations.
An overflow status happens when the result is over the range afforded by the positive binary
number. Given the following 8-bit positive binary number addition operation, the result
would exceed the maximum of 1111 1111.
Overflow: 1100 0000 + 1100 0000 = 1 1000 0000 (only 8 digits are stored)
Overflow can also happen when the result is below the range afforded by the positive binary
number.
Overflow: 0100 0000 - 1000 0000 = less than 0
Overflow will inevitably happen and so the ALU must detect them. For addition operation, a
carry occurs in the most significant pair indicates overflow. For subtraction operation, a
borrowing occurs in the most significant pair indicates overflow.

18
Binary Coded Decimal (BCD)
Binary coded decimal is a special representation that codes each decimal digit independently
into binary representation. For example, the decimal number 68(decimal) is coded in BCD as
the following. Each decimal digit requires 4 bits of binary digits to encode.
6 -> 0110
8 -> 1000
So 68 (decimal) is equivalent to 0110 1000 (BCD)
A drawback of this approach is in the range of numbers that it can represent. A 8-digit BCD
can cover only 0 to 99(decimal) but a 8-digit positive binary number can cover 0 to
255(decimal). Resource utilization is low.

Bit size Range Range (Binary) Number of Unique Values


4-bit 0 to 9 0000 to 1001 10
8-bit 0 to 99 0000 0000 to 1001 1001 100
16-bit 0 to 9999 0000 0000 0000 0000 to 1001 1001 1001 1001 10000
32-bit 0 to 99999999 Too long to show 100000000

Addition and subtraction operations pose little problem in circuitry implementation. Each 4
digits are considered together in one operation and so addition of two 8-bit BCD numbers
requires two addition operations. BCD addition and subtraction is therefore more efficient
than that in the positive binary representation. The circuitry is a bit more complex. Another
advantage is the ease of conversion to printing and LCD display formats.

Sign-Magnitude Binary Representation


Both the positive binary representation and BCD cover only positive range of values. There
are several variants that can cover negative values as well.
Sign-magnitude representation assigns the most significant bit as the indicator of negativity.
A value of 1 indicates a negative value and 0 indicates positive. In an 8-bit sign-magnitude
binary number, one bit is used for the sign bit and the remaining 7-bits for the magnitude.

Bit size Minimum (Binary) Maximum(Binary) Range (Decimal) Number of Unique Values
8-bit 1111 1111 0111 1111 -127 to +127 255
16-bit 1111 1111 1111 1111 0111 1111 1111 1111 -32767 to +32767 65535

The resource utilization is good, except that there are now two numbers representing the value
zero. They are 0000 0000 and 1000 0000. So an 8-bit sign-magnitude binary representation
can represent 255 values only (from -127 to +127).
Addition and subtraction of sign-magnitude binary numbers is more challenging. The
operations cannot be simplified into smaller operations on individual digits.

19
One's Complement Binary Representation
The method of complement is sometimes used in subtraction. This method turns a subtraction
operation into a complement and an addition operation. For example, the expression 654 -
234 is converted into a complement operation for 234 and an addition operation:
654 - 234 (-234 is converted into +766 by subtracting it from 1000)
654 + ( +766) = 1 420 = 420 (The carry 1 is discarded)
In one's complement binary representation, we represent negative numbers by finding
one's complement of the corresponding positive number. The one's complement operation is
carried out by inverting every bit (turning 0 into 1 and 1 into 0).
Given a positive 8-bit binary number 0011 1000 (decimal 56), to find its corresponding
negative number (decimal -56), we apply one's complement operation to the 8-bit binary
number.

Each time the one's complement operation is applied, the sign of the number is reversed. This
is equivalent to a negation operation.
Note that the number 1100 0111 is in 1's complement format, and it cannot be directly
converted to decimal if it is a negative number. One's complement binary representation is
the system that uses one's complement operation to work out a positive binary number's
corresponding negative number.

20
The following table shows the range of numbers that can be represented by 8-bit and 16-bit
one's complement binary representation.

Bit size Minimum (Binary) Maximum(Binary) Range (Decimal) Number of Unique Values
8-bit 1000 0000 0111 1111 -127 to +127 255
16-bit 1000 0000 0000 0000 0111 1111 1111 1111 -32767 to +32767 65535

A key advantage of using complement based representation is simplification of circuitry.


Only the addition operation circuitry is needed and the subtraction is carried out by a simple
complement operation and the addition operation.
0100 0000 - 0000 0001
0100 0000 + (-0000 0001)
<Doing one's complement on 0000 0001 and add one>
0100 0000 + (1111 1110) + 1 = 1 0011 1111 <The Carry 1 is discarded>

Exercise: 1's complement representation


Question: Convert the following 8-bit 1's complement binary numbers into decimal.
(i) 0001 1111 (ii) 1000 0001
Answer:
(i) 0001 1111 is positive. We can consider it as a positive binary number and convert it
directly to decimal. 0001 1111 = 31 (decimal)
(ii) 1000 0001 is negative. We cannot convert it directly to decimal. We must obtain its
positive equivalence with 1's complement operation.
1000 0001 >>> 0111 1110 (positive binary) <1's complement operation>
0111 1110 >>> +126 (decimal) <numeral conversion>
+126 >>> -126 (decimal) <negation>

21
Two's Complement Binary Representation
The drawback of the one's complement binary representation is a little wastage in the range of
value that it can represent. For an 8-bit binary number, from 0000 0000 to 1111 1111 there
are 256 different patterns. So 8-bit can represent 256 values if there is no wastage. The range
of a 8-bit one's complement representation is -127 to 127, and there are only 255 different
values.
The two's complement binary representation is a small modification of the one's
complement counterpart. The only difference is that two's complement operation is used
instead of one's complement operation.
Two's complement operation has two steps:
• One's complement operation (bit inversion)
• Add one
Given a positive 8-bit binary number 0011 1000 (decimal 56), to find its corresponding
negative number (decimal -56), we apply two's complement operation to the 8-bit binary
number.

Same as the one's complement representation, the two's complement operation is equivalent
to a negation operation. A negative 2's complement number cannot be converted directly to a
decimal number. Its positive absolute value should be determined first by 2's complement
operation, and then the numeral system conversion can take place.

22
The range of an 8-bit two's complement representation is -128 to +127, and there are 256
unique values.
Bit size Minimum (Binary) Maximum(Binary) Range (Decimal) Number of Unique Values
8-bit 1000 0000 0111 1111 -128 to +127 256
16-bit 1000 0000 0000 0000 0111 1111 1111 1111 -32768 to +32767 65536

Exercise: 2's complement representation


Question: Calculate that the 8-bit 2's complement number of 1000 0000 is equivalent to
decimal -128.
Answer: The number 1000 0000 is a negative number because the sign-bit is 1. We must
first find its positive equivalence by doing a 2's complement operation, and then perform
the numeral system conversion.
1000 0000 >>> 0111 1111 + 1 = 1000 0000 (positive binary) <2's complement operation>
1000 0000 (positive binary) >>> 128 (decimal) <numeral system conversion>
128 (decimal) >>> -128 (decimal) <negation>

Exercise: binary representation


Question: Given the 8-bit binary pattern of 0100 0010, work out its decimal value if the
pattern is of each of the following representation. (i) positive binary (ii) 1's complement
(iii) 2's complement.
Answer:
(i) Direct conversion. 0100 0010 >>> 64 + 2 = 66 (decimal)
(ii) In 1's complement representation it is a positive number. It can be considered as a
positive binary number as above. 0100 0010 >>> 64 + 2 = 66 (decimal)
(iii) In 2's complement representation it is a positive number. It can be considered as a
positive binary number as above. 0100 0010 >>> 64 + 2 = 66 (decimal)
The values for the patterns are the same in different representations if it is a positive value.

Exercise: binary representation


Question: Given the 8-bit binary pattern of 1110 1011, work out its decimal value if the
pattern is of each of the following representation. (i) positive binary (ii) 1's complement
(iii) 2's complement.
Answer:
(i) Direct conversion. 1110 1011 >>> 128 + 64 + 32 + 8 + 2 + 1 = 235 (decimal)
(ii) In 1's complement representation it is a negative number. First 1's complement
operation to obtain its positive equivalence. 1110 1011 >>> 0001 0100. Then convert it.
0001 0100 >>> 16 + 4 = +20 (decimal). Finally perform negation. +20 >>> -20 (decimal)
(iii) In 2's complement representation it is a negative number. First 2's complement
operation to obtain its positive equivalence. 1110 1011 >>> 0001 0101. Then convert it.
0001 0101 >>> 16 + 4 + 1 = +21 (decimal). Finally perform negation.
+21 >>> -21 (decimal)
Given the same pattern, the number means different values under different representation
methods.

23
The ALU will support at least three arithmetic operations: negation, addition, and subtraction.
Negation is easy with 2's complement representation because it is equivalent to 2's
complement operation.
Addition and subtraction is actually almost the same implementation with 2's complement
representation. A subtraction can be changed into an addition of a negated operand. For
example A-B is equivalent to A+(-B). The negation of B is done with 2's complement
operation and we then apply addition to the two operands.
Normally the number of bits assigned to represent an integer is fixed for a particular ALU.
This fixed number of bits restricts the range of values that it can represent. It can also cause
problem for arithmetic operations.

Example: addition of 2's complement numbers


Question: Evaluate the following expressions of 2's complement numbers
(i) 0001 1110 + 0010 1111
(ii) 0100 1110 + 1110 1111
(iii) 0100 1110 – 1110 1111
(iv) 0111 1111 + 0000 1000
Answer:
(i) Both operands are positive and they can be considered as positive binary numbers. The
addition follows the old method. 0001 1110 + 0010 1111 = 0100 1101
(ii) The second operand is a negative number. However, the addition can proceed as usual.
0100 1110 + 1110 1111 = 1 0011 1101. The carry 1 is discarded.
Recheck 78 + (-17) = 61.
(iii) The operator is subtraction. Convert it back to addition. 0100 1110 – 1110 1111 >>>
0100 1110 + (-1110 1111). Apply 2's complement to the second operand for the negation.
0100 1110 + (-1110 1111) >>> 0100 1110 + 0001 0001 = 0101 1111.
Recheck 78 – (-17) = 95.
(iv) This seems a straight forward calculation. Both operands are positive and use the old
method. 0111 1111 + 0000 1000 = 1000 0111. Recheck would reveal something wrong.
127 + 8 should equal to 135. However, the answer above 1000 0111 is -121 (decimal). A
positive number plus another positive number should not give a negative number. There is
an error due to the range limitation. The largest 8-bit 2's complement number is +127. It
cannot represent the correct answer +135. We call this overflow error.

Overflow error can be easily detected in two's complement operation. The following table
summarizes the possibility of overflow in addition and subtraction operations.

Operation Overflow Detection


Addition of two +ve numbers Possible Overflow occurs if the result is a negative number
Addition of two -ve numbers Possible Overflow occurs if the result is a positive number
Addition of a +ve and a -ve number Impossible N/A

24
Example: addition of 2's complement numbers
Question: Consider the examples of 127 + 1 and -128 - 1 in 8-bit 2's complement binary
representation.
Answer:
127 + 1
0111 1111 + 0000 0001 = 1000 0000 <= -128 in decimal
Clearly something is wrong because the addition of two positive numbers cannot result in a
negative number.
-128 - 1 => (-128) + (-1)
1000 0000 + 1111 1111 = 1 0111 1111 <= Ignore carry bit, it is 127 in decimal.
Again, something is wrong here because subtracting a positive number from negative
number should not result in a positive number.

Overflows are signals/conditions indicating the result has gone out of range. This is an
exception status needed to be detected.
• Overflow is set when the result of addition or subtraction overflows into the sign bit.
• Addition of two positive integers cannot result in a negative integer and this is an
overflow.
• Addition of two negative integers cannot result in a positive integer and this is an
overflow.
Carry is another important status. Computer systems have a limitation to which the number
of bits can be handled in an operation. For example, a 64 bit value may be used in a computer
system that can handle only 32 bit value operations. Typically the 64 bit value is separated
into two parts, each of which is 32 bit value suitable for operation. Carry status is set when
the result of an addition or subtraction exceeds fixed number of bits allocated.

7. Summary

We have more details on the status output of the ALU. The following figure shows the
revised ALU design. This ALU can carry out integer addition, subtraction, and negation in
2's complement binary representation.

25
Appendix 2A. Base Conversion

Conversion between numbers of different base (numeral system) can be done easily by many
modern calculators.
However, if you are required to do by hand, then there are two approaches to choose from. It
depends on the source base and the destination base.
• If the source and the destination bases are powers of 2, such as binary, octal, or
hexadecimal, then conversion is easy if it goes through binary. For example,
converting base-8 to base-16 can first convert base-8 to base-2, then base-2 to base-
16.
• If the bases are of other numbers, then using the decimal numeral system as the point
of interchange is efficient. For example, to convert from base-3 to base-16, the
method is to first convert from base-3 to base-10, and then base-10 to base-16. Using
base-10 as the point of interchange allows us to do most of the arithmetic in decimal,
which is something we are familiar with.

The following will describe how to convert base-N to base-10, and then base-10 to base-N.
In the following we assume that the positional notation is adopted in all bases and negative
values are represented with a negative sign.

26
A1. Conversion from base-N to base-10
Numbers in base-N numeral system can be converted to decimal easily if they adopt the
positional notation representation.
In positional notation representation, the value of a symbol depends on both the symbol itself
and also the position of the symbol appeared in a number. The right most position is position
0, and then the position index increases by one as moving left one digit. The value
contributed by a digit is equals to the product of the intrinsic value of the symbol and the base
to the power of the position index.
Value of a digit = Intrinsic value of symbol * Base(Position index)
Take the decimal numbers 10020 and 3100 as examples. The symbol '1' is at position index 4
in the number 10020, and so the digit is worth 1 (intrinsic value) * 104 = 10000 (decimal).
The symbol '1' is at position index 2 in the number 3100 , and so the digit is worth 1 (intrinsic
value) * 102 = 100 (decimal).

Example: base conversion


Question: Convert 101010 (plain binary) to decimal
Answer:
Digits 1 0 1 0 1 0
5 4 3 2 1 0
Representation 2 2 2 2 2 2
Value of digit 32 0 8 0 2 0

Value = 32 + 8 + 2 = 42 (decimal)
Question: Convert 123 (octal) to decimal.
Answer:
Digits 1 2 3
2 1 0
Representation 8 8 8
Value of digit 1 * 64 2*8 3*1

Total = 64 + 16 + 3 = 83 (decimal)

Values in any number system can be converted to decimal using the same method.

27
A2. Conversion from base-10 to base-N
Values in decimal numeral system can be converted into any other numeral system by the
method of division.
If a decimal number D is to be converted to base B, D the decimal number D is repeatedly
divided by B until the result is zero. The remainders are to become the digits of the value in
the number system B.

Example: base conversion


Question: Convert 42 (decimal) into binary
Answer:
Division Remainder Remarks
42 / 2 >>> 0 least significant bit
21 / 2 >>> 1
10 / 2 >>> 0
5/2 >>> 1
2/2 >>> 0
1/2 >>> 1 most significant bit

The binary number is read from the bottom to the top: 101010

A3. Conversion between base-2 base-8 and base-16


The octal (base-8) and hexadecimal (base-16) numeral systems are convenient forms of
binary numeral systems. The conversion between them is simple.

Conversion between base-2 and base-16


The digits of a binary number are divided into groups of 4 digits, starting from the least
significant digit. Each group of 4 binary digits is converted into one hexadecimal digit. A
lookup table like the following can be used.
Binary Hex Binary Hex Binary Hex Binary Hex
0000 0 0100 4 1000 8 1100 C
0001 1 0101 5 1001 9 1101 D
0010 2 0110 6 1010 A 1110 E
0011 3 0111 7 1011 B 1111 F

Example: base conversion


Question: Convert 000010101111110010 into equivalent hexadecimal.
Answer:
000010101111110010 <dividing into groups of four digits>
00 0010 1011 1111 0010 <convert each group into the corresponding hex number>
0 2 B F 2 (Hex)

To convert from hexadecimal to binary, the above process is simply reversed. The same table
is used to convert each hexadecimal digit into 4 binary digits.

28
Conversion between base-2 and base-8
The process is basically the same except that binary numbers should be divided into groups of
three digits.

Example: base conversion


Question: Convert 000010101111110010 into equivalent octal.
Answer:
000010101111110010 <dividing into groups of three digits>
000 010 101 111 110 010 <each group is converted into a octal digit>
0 2 5 7 6 2 (Octal)

Powers of 2
Some commonly used powers of 2 should be remembered

Power Binary bits Exponent Notation Decimal


0
0 0 2 0
1
1 10 2 2
2
2 100 2 4
3
3 1000 2 8
4
4 10000 2 16
5
5 100000 2 32
6
6 1000000 2 64
7
7 10000000 2 128
8
8 1 0000 0000 2 256
12
12 1 0000 0000 0000 2 4096
16
16 1 0000 0000 0000 0000 2 65536
20
20 1 0000 0000 0000 0000 0000 0000 2 1048576 (1M)
24
24 - 2 16777216 (16M)
28
28 - 2 268435456 (256M)
32
32 - 2 4294967296(4G)

29
Appendix 2B. Radix Three

Is there a best numeral system?


Decimal system is favoured by the current culture and binary system is used by computer
designers and engineers. Each of them has its merits.
In the following we will show that a base 3 numeral system is most efficient mathematically.
Base 10 is culturally preferred and base 2 is engineering preferred. The base 3 number
system has a right balance between the number of symbols and the width of representation.
A numeral system is based on a particular number of symbols, and the number of symbols
affects the range of values it can represent. For a particular value, one numeral system may
need a certain number of digits and another numeral system may need more digits.
Given a decimal number of 1000 (decimal), the following table shows its value-equivalence
in other numeral systems.

Base No of Symbols Equivalent Number Width (No of Digits)


2 2 1111101000 10
3 3 1101001 7
8 8 1750 4
10 10 1000 4
16 16 3E8 3

Consider another example that we want to represent numbers from 0 (decimal) to


99 (decimal).
In binary system: we can represent the range from 000 0000 to 110 0011. There are 2 types
of symbols in binary system: 0 and 1. The width required to represent the range is 7 (there are
7 symbols arranged in positional notation). The number of symbols is small, however the
width is large.
In decimal system: we can represent the range from 00 to 99. There are 10 types of symbols
in decimal system: 0, 1, 2, ..., to 9. The width required to represent the range is 2 (there are 2
symbols arranged in positional notation). The number of symbols is larger, however the
width is much smaller.
In a more extreme case, a base-100 system, there are 100 types of symbols. Each symbol
represents a value from 0 to 99. The width required to represent the range 0 to 99 is 1 (only 1
symbol is required). The number of symbols is very large, but the width is the smallest.

Numeral System Number of Symbols (Radix) Width Required to Represent 0 to 99


Binary 2 7
Decimal 10 2
Centesimal 100 1

30
Both number of symbols and the width of representation have implications.
• The number of symbols affects the coding of digital signal lines. The more symbols
there are, the more error prone in data transmission. It is more engineering
challenging to design a reliable many-symbol signal line. (From a more human
perspective, to learn 2 symbols is easy, and to learn 10 symbols (0 to 9) is more
difficult but manageable for kids, and to learn 100 symbols is however challenging).
• The width of representation affects the amount of signal lines required. Each digit
requires one signal line, and the addition of a signal line adds complexity and cost to
the computer system.
We want to minimize both the number of symbols and the width of representation, while keep
the range of representation the same. In the above example, we keep the range 0 to 99 the
same to allow us to analyse the relation between the number of symbols and the width.
In mathematical terms, we want to minimize the product of number of symbols n and the
width w, while holding the range constant. The range is found by nw.
The optimal number of symbols (the radix) is found to be the constant e (equals to 2.718).
This calculation assumes that the variable r and w are continuous rather than discrete. The
following graph shows the minimal point is 2.718 for different nw (nw = 10, nw = 100, and
nw = 1000).

For practical reasons we need an integer as the base radix. Because the integer 3 is closer to
2.718 than any other integer, base 3 (ternary) is the most efficient representation
mathematically.
We come back to the question of does a best number system for computer systems exist?
The answer is yes but it depends on your criterion for what constitute your best system.
Different people may come up with different answers. Clearly, almost all modern computer
systems are based on binary. There are two possible reasons for this phenomenon:
• The first computers were based on binary system and so it was cost effective for
subsequent design to follow a well proven model. Therefore binary computer
systems captured a significant "market share".
• Technologically, binary system is easier to implement because of the engineering
techniques for binary operations were well established.

31
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 3. Arithmetic and Logic Unit Design: Floating Point Operations

In this chapter we will refine the design of Arithmetic and Logic Unit for fractional values.
Integers are a very special type of numbers. Real world data is continuous.
• If precision is not an important requirement, then we can use integers to approximate
such real world data.
• If higher precision is required, then floating point representation should be used.
The floating-point representation allows the representation of numbers with a particular
precision requirement. Normally the precision level of a number is equivalent to the number
of significant digits.
• Anders has 1 million dollars asset. We are not sure whether he has exactly
$1,000,000 or $1,999,999. The number of significant digit is only one. It is not
precise.
• Betsy has asset valued $12,540 thousands. We are not sure if she has $12,540,000 or
$12,540,999. However, the number of significant digits is 5. It is a more accurate
description of asset.
The floating-point representation allows the representation of a very large number or a very
small number with any number of significant digits. It is based on the exponent
representation so that the radix point can float or move forward or backward and at the same
time adjusting the exponent.

32
1. Exponential and Floating Point Representations
The floating-point representation is based on the exponent representation. In decimal, the
exponent representation is in the following form.
Sign Significant-Digits x 10exponent

Number in Exp Value in decimal (precision is not Sign Exponent #Significant


Representation indicated) Digits
4
1.23 x 10 12300.0 + 4 3
24
1.62001 x 10 1620001000000000000000000.0 + 24 6
-18
5.2 x 10 0.0000000000000000052 + -18 2
0
- 2.3456 x 10 -2.3456 - 0 5

One important feature of the exponent representation is an economical format. The big
number 1620001000000000000000000.0 can be succinctly represented as 1.62001 x 1024.
Real world data are sometimes of a very small or very large magnitude. For example, the
number 0.0000 0000 0000 0000 0001 requires 20 digits to represent. This value is actually
not very small, if we compare it to some of the physical world values such as the mass of
electronics and the diameter of an atom.
The sequence of zeros is useful only in indicating the position of the last digit 1. It may be
more concise to say that the number is 1 after 19 zeros to the right of the radix point. The
same goes for very large numbers such as 1234000000000000.0
If we are finding a representation for our ALU and programmable computer, then a simpler
and more economical format is an advantage.
We can convert any decimal number into exponent format by the process of normalization.
This is a process of moving the radix point of the number until the following normalized form
is achieved.
M.MMMMM… x 10exponent
The number of M type digits depends on the number of significant digits. The M type digits
are called the magnitude part or the mantissa. The position of the radix point is between the
first and the second digit.
• Moving the radix point one digit to the left is equivalent to division by 10. So we
compensate it by adding 1 to the exponent.
• Moving the radix point one digit to the right is equivalent to multiplication by 10. So
we compensate it by subtracting 1 from the exponent.

33
Exercises: Exponent Format
Question: Convert the following decimal numbers into exponent format of the normalized
form. For each of them, indicate the sign, magnitude, and exponent parts. (i) 123000000;
(ii) 123456789; (iii) -3450000; (iv) 0.00001234; (v) -0.00000823

Decimal Number Exponential Format Sign Exponent Mantissa


8
123000000 1.23 x 10 + 8 1.23
8
123456789 1.23456789 x 10 + 8 1.23456789
6
-3450000 -3.45 x 10 - 6 3.45
-5
0.00001234 1.234 x 10 + -5 1.234
-6
-0.00000823 8.23 x 10 - -6 8.23

When we are designing a representation dealing with floating point values for the ALU, then
we should start from the resource point of view. We decide that we can afford to have 8
digits to represent a floating point value. So the design issue is how to assign different roles
for the 8 digits.
Remind us that the exponent format has three parts.
• The sign (indicate positive or negative)
• The exponent and the base (indicate the position of the significant digits and the base
of the numeral system)
• The mantissa (the significant or the significant digits)
If the base is set to be 10 (decimal), the question is about assigning which digits are for the
sign, exponent, and mantissa respectively.
Consider the following format for the 8-digit number.
SEEM MMMM
• The symbol S represents the sign. The digit 0 indicates positive and 1 negative.
• The symbol E indicates the exponent digits. The possible range of EE is
from 00 to 99.
• The symbol M indicates the mantissa digits.
The above number represents the following number in exponential format.
S M.MMMM x 10EE

Example: Exponential format


Question: Given the format SEEMMMMM as described above, work out the value of the
following numbers of the format.
(i) 02010020 (ii) 10199999
Answer:
(i) S = 0; E = 20; M = 10020; The number in exponential format is + 1.0020 x 1020
(ii) S = 1; E = 01; M = 99999; The number in exponential format is - 9.9999 x 1001
This format is not able to represent numbers with negative exponents such as 1.0000 x 10-2 or
0.01. This is a serious shortcoming.

34
The range of the exponent can be refined through a method called excess-N. If the EE is in
excess-50 format, then the number is in excess of 50. The actual value it represents is less 50.
For example, if EE is 01, then the actual exponent is (01-50) = -49.
The excess-N notation changes the range of exponent from 00>99 to -50>+49. The range is
now extended to negative exponents.

Example: Exponential format with Excess-N exponent


Question: Given the format SEEMMMMM and now the EE in excess-50 format, work out
the value of the following numbers of the format.
(i) 02010020 (ii) 10199999
Answer:
(i) S = 0; E = 20; M = 10020; The number in exponential format is + 1.0020 x 10(20-50). It is
equal to + 1.0020 x 10(-30) or 0.000000000000000000000000000001002.
(ii) S = 1; E = 01; M = 99999; The number in exponential format is - 9.9999 x 10(01-50). It is
equal to - 9.9999 x 10(-49).
The excess-N notation is useful because there is no symbol for representing negative sign in
our format. So we use the concept of excess to extend the range to negative. The following
figure shows the different range with and without the excess-50 notation in our
representation.

Note that there is a region around 0.00000 outside the range of this representation method.
These are very small magnitude numbers. When the representation method fails to represent
such value range, it is known as underflow.

35
Arithmetic operations in exponential representation may be carried out using the following
procedures:
• In addition and subtraction, exponent and mantissa are handled separately. Exponent
must first be aligned and then mantissa overflow can be fixed by adjusting the exponent.
0.12 x 10-2 + 0.345 x 10-4
0.12 x 10-2 + 0.00345 x 10-2
0.12345 x 10-2
• In multiplication and division, the mantissa can be operated normally. The exponent part
is carried out by addition or subtraction. Normalization is required through excess-50
adjustment.
0.2 x 10-2 x 0.4 x 103
0.08 x 101

36
2. IEEE 754 Representation
The IEEE 754 Floating Point Standard is a standard for representing floating point values in
modern computers. The whole name of the standard is called the IEEE Standard for Binary
Floating-Point Arithmetic (ANSI/IEEE Std 754-1985). The original standard defines two
formats of different precision.
• The single-precision 32-bit binary format uses 32 bits to represent a floating-point
number using the exponential representation and excess-N notation.
• The double-precision 64-bit binary format uses 64 bits to represent a floating-point
number using the exponential representation and excess-N notation.
There is also a 128-bit binary format and two decimal formats introduced in 2008, which will
not be discussed here.
The single-precision 32-bit format divides up the 32 bits into the following.
• Sign-bit (1-bit).
• Exponent bits (8-bits) in excess-127 notation and positive binary format.
• Mantissa bits (23-bits).
The above figure shows that the roles of each bit in the 32-bit binary number.

The format of the magnitude part has two modes.


• In the normalized mode, the magnitude part always starts with 1.
• The denormalized mode is specially designed for representing very small magnitude
number, in which the magnitude part starts with 0 and the exponent is always –126.

37
The exponent is in excess-127 notation. The range of allowable exponents is from -126 to
127 only. Some exponent values are reserved.

Exponent Part Mantissa Part Value


0 0 Zero
0 Non-zero Denormalized numbers
1 to 254 Any Normalized numbers
255 0 Infinite number
255 Non-zero NaN

The value NaN means Not A Number. This is a special flag often used as a result of invalid
operations, for example, division by zero or square root of a negative value.

Example: Smallest Magnitude Value in Normalized and Denormalized Mode


Question: What is the smallest magnitude value in normalized and denormalized mode
respectively?
Answer:
The smallest normalized mode IEEE 754 number is the following.
0 0000 0001 0000 0000 0000 0000 0000 000
The three parts are: S = 0, E = 0000 0001, M = 0000 0000 0000 0000 0000 000
This IEEE 754 number in exponential format is the following.
>>> + 1. 0000 0000 0000 0000 0000 000 x 2 (0000 0001 – 0111 1111)
>>> + 1. 0000 0000 0000 0000 0000 000 x 2 –0111 1110
Convert to decimal
>>> + 1.0 x 2 –126 (approximately 1.18 x 10-38)

The smallest denormalized mode IEEE 754 number is the following.


0 0000 0000 0000 0000 0000 0000 0000 001
The three parts are: S = 0, E = 0000 0000, M = 0000 0000 0000 0000 0000 001
This IEEE 754 number in exponential format is the following.
>>> + 0. 0000 0000 0000 0000 0000 001 x 2 –0111 1110
>>> + 1. 0000 0000 0000 0000 0000 000 x 2 –10010101 <moving radix to the right>
Convert to decimal
>>> + 1.0 x 2 –149 (approximately 1.4 x 10-45)
With denormalized mode, the IEEE 754 standard can represent smaller magnitude
numbers.

38
Conversion from decimal into IEEE 754 Numbers
The following steps allow the conversion from decimal into IEEE 754 numbers.
• Find out if it is a special value and convert it as such. The special values include zero,
infinity, very small numbers that are in the denormalized range.
• Convert it as a decimal number to a binary number.
• Two sides of the decimal point (the integral part and the fractional part) should be
converted separately.
• Convert the binary number into a normalized format by adjusting the exponent and
mantissa accordingly.
• Reform the digits into IEEE 754 formats.

Example: Conversion from decimal to IEEE 754 formats.


Question: Convert the following decimal numbers into IEEE 754 numbers.
(i) 0; (ii) -1.75; (iii) 9876.25
Answers:
(i) The number 0 is a special value. This is simple conversion by looking up the table.
S = 0 E = 0000 0000 M = 0000 0000 0000 0000 0000 000
The number is 0 0000 0000 0000 0000 0000 0000 0000 000 (IEEE 754-32 bit)

(ii) We convert -1.75 into binary number first.


The integral part = 1 (decimal) = 1 (binary)
The fraction part = 0.75 (decimal) = 0.5 + 0.25 = 2-1 + 2-2 = 0.11 (binary)
Binary format = 1.11 x 20
Normalization >>> 1.11 x 20 >>> 1.11 x 2(0111 1111 – 0111 1111)
S = 1 E = 0111 1111 M = 1100 0000 0000 0000 0000 000 (note that the leading 1 is not
part of the mantissa digits)
The number is 1 0111 1111 1100 0000 0000 0000 0000 000 (IEEE 754-32 bit)

(iii) We convert 9876.25into binary number first.


The integral part = 9876 (decimal) = 10 0110 1001 0100 (binary)
The fraction part = 0.25 (decimal) = 2-2 = 0.01 (binary)
Binary format = 10 0110 1001 0100.01 x 20
Normalization >>> 1.0 0110 1001 010001 x 213 >>> 1.0 0110 1001 010001 x 21101
>>> 1.0 0110 1001 010001 x 2(10001100 – 0111 1111)
S = 0 E = 10001100 M = 0011 0100 1010 0010 0000 000
The number is 0 10001100 0011 0100 1010 0010 0000 000 (IEEE 754-32 bit)

39
Conversion from IEEE 754 Numbers to decimal
The following steps allow the conversion from IEEE 754 numbers to decimal.
• Extracts the three parts: sign, exponent, and mantissa.
• Find out if it is a special value and convert it as such. Check the exponent and the
mantissa parts.
• Put the digits in the normalized form if it is in normalized mode.
• Remove the exponent by shifting the radix point.
• Convert the integral part and fraction part separately into decimal.

Example: Conversion from decimal to IEEE 754 formats.


Question: Convert the following IEEE 754 number into decimal
0 1000 0100 0101 0000 0000 0000 0000 000
Answer:
It is not a special value. So convert it as a normalized number.
S = 0; E = 1000 0100 M = 0101 0000 0000 0000 0000 000
Normalised form >>> 1.0101 x 2(10000100-01111111) >>> 1.0101 x 2101 >>> 1.0101 x 25
Remove the exponent format >>> 101010.0 x 20 >>> 101010.0
Convert into decimal >>> 32 + 8 + 2 = 42 (decimal)

The IEEE 754 64-bit standard offers greater range and precision. It has the following format.
• Sign-bit (1-bit).
• Exponent bits (11-bits) in excess-1023 notation and positive binary format.
• Mantissa bits (52-bits).

Reading: IEEE 754 Standard


http://en.wikipedia.org/wiki/IEEE_754
You can read more about the latest development and the details of the IEEE 754-2008
format.

40
3. Alternative Representation Methods
There are actually several alternative representation modes that we have decided against in
our ALU design. Each of them has its limitations and you should be able to analyse and
discuss these limitations.

Fractional Values in Binary Representation


In our commonly decimal numeral system, fractional values are represented with the help of a
radix point, decimal separator or more popular known as decimal point. The radix point
divides a number into integral part and fractional part. The following figure shows two
examples of fractional decimal numbers.

The radix point determines the significance of digits. The digit to the immediate left of the
radix point has positional index of 0. The following table shows why the number 234.56 has
the value 234.56.

Digits 2 3 4. 5 6
Representation 10 2 or 100 1
10 or 10
0
10 or 1 10
-1
or 0.1 10
-2
or 0.01
Value of digit 200 30 4 0.5 0.06

Total: 200 + 30 + 4 + 0.5 + 0.06

The same representation can be applied to binary numbers. We can place a radix point to
denote the position of the digit of positional index 0. The following table shows the value of
the binary number 1101.01 (decimal)

Digits 1 1 0 1. 0 1
Representation 2 3 or 8 2
2 or 4
1
2 or 2
0
2 or 1 2
-1
or 0.5 2
-2
or 0.25
Value of digit 8 4 0 1 0 0.25

The positive binary number 1101.012 can be evaluated as the following.


Total: 1 x 2 3 + 1 x 2 2 + 0 x 2 1 + 1 x 2 0 + 0 x 2 -1 + 1 x 2 -2
= 8 + 4 + 0 + 1 + 0 + 0.25 = 13.2510
This method, however, cannot be incorporated into our ALU design. The radix point (or
binary point) is another symbol, which makes the total number of symbols to be 3 ('0', '1' and
'.') We have already adopted a 2-level digital representation, one for 0 and another for 1.
There is no room for another symbol.

41
Fixed Radix Point
One solution to the above problem is to fix the radix point at some position. For example, we
can fix the radix point between positional index 1 and 2 in an 8-bit binary number. Because
the radix point is built into the format, there is no need for another symbol.
So the 8-bit binary number 11010101 actually means 110101.01
The radix point is implicitly implied to be at the position between index 1 and 2. It decides
how many digits belong to the integral part and how many belong to the fractional part.
If this solution is used, the designer must be careful in deciding the implicit position of the
radix point. Compare the following two designs: one fixes the radix point before index 5 and
one fixes it before index 1.

Radix point
Example Integral part Fractional part Drawback
before index
11010101 => Possible fractional values are 0.00
1 6 digits 2 digits
110101.01 0.25, 0.5 and 0.75 only

11010101 =>
5 2 digits 6 digits The maximum value is only 3.984375
11.010101

The first option suffers from a precision problem and the second option suffers from a range
problem.

We can achieve better range and precision by using a longer representation (16-bit or 32-bit).
Still care must be taken to decide the distribution of digits between the integral part and
fractional part.

42
4. ALU Port Design for Fractional Value Operations
If we adopt the IEEE 754 single precision data representation format, then we can settle on
the input and output interface of the ALU. IEEE 754 single precision uses 32-bits to
represent a number and so the input and output must have 32 signal lines.
We will introduce a new terminology here: bus. A bus is a set of signal line that connects
two components in a computer. Each input is therefore a 32-bit bus. Similarly the output
data size should match the input data size and so the output is also a 32-bit bus.
If the ALU supports both integer and floating-point arithmetic, then the same input buses can
be used for both types of operations. In the case of integer operations, the input data is of 32-
bit 2's complement binary representation.
The following diagram shows our design of the ALU with details in the input and output
ports.

In the real life, some ALUs supports only the integral operations or the floating point
operations, but not both. Incorporating both types of operations in a single ALU will increase
the complexity of the internal design.
In some cases, a computer design includes two ALU, one for integral operations and another
one for floating point operations. Such a computer will include an additional component to
pass instructions and data to the relevant ALU.

5. Summary

We have discussed how to deal with fractional values in our ALU design. With 32-bit input
and output ports, and 2's complement binary representation and IEEE 754 floating point
representation, a large set of data can be handled by the ALU.

43
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 4. Basic Programmable Computer Design: From ALU to CPU

This chapter will discuss the design of a basic programmable computer. We will use our 2's
complement binary representation ALU to be the core of our first programmable computer.

1. First Design of Programmable Computer


This chapter will first give a description of the design of our basic programmable computer.
Then we will review how the design of individual components was reached.
The following is a schematic diagram of the design of the programmable computer.

The programmable computer consists of the following components.


• Arithmetic and Logic Unit (ALU). This component will execute instructions and
operations.
• Registers. Registers are the yellow boxes in the above diagram. They are fast data
storage. Each register serves a different purpose: storage for intermediate results, storage
for instructions, bookkeeping for instruction execution, and interfacing with other
components.

44
• Memory. This component allows the storage and retrieval of data. The ALU obtain
instructions and data from the memory. It interfaces with the ALU through the Memory
Management Unit (MMU).
• Controller. This component coordinates and controls the operations of other components
including the ALU and the registers. The controller, control unit, or micro-controller
can be regarded as the commander of the computer components, directing everything to
work together.
• System bus. The system bus allows data exchange between the ALU and the registers.
• Clock. The clock provides a signal to allow various components to work in
synchronization.

Operation of the Programmable Computer

The programmable computer is re-purposed through giving it different instructions to


execute. A key design of this computer is that the instructions and data are both stored in the
memory. The instructions are the result of careful composition of programmers. These
instructions are usually collectively called a program.
This design of a programmable computer is known as stored-program architecture or von
Neumann architecture. Von Neumann was known as the last of the great mathematicians
and he was involved in the first atomic bomb development and the ENIAC computer design.

The programmable computer operates in cycles that involve the following tasks.
• Read the next instruction from the Memory.
• Store the instruction in one of the Registers.
• The Controller examines the instruction and issues a series of commands to the
components. The commands are usually concerning asking the ALU to calculate or
moving data between the registers.
The tasks are repeated indefinitely. The last task is the key in a programmable computer that
it performs different actions according to the instructions. The Controller is capable to
recognize different instructions and to direct the components to perform the required tasks.
This design of a programmable computer includes many important features. The following
sections will discuss each of these features in details.
45
2. The Clock and Synchronization of Operations

The ALU receives data from its two input ports, and performs an arithmetic or logic
operation, and delivers the result to its output port.
• For the ALU to operate correctly, the data must arrive at the two input ports in time for
the operation to happen.
• If the data has not arrived in time, then the operation result will be incorrect.
Our solution is based on synchronization of the operations by a clock. Synchronization is a
general technique to make things happening at scheduled times.
• The scheduled times are determined by a clock signal and they are the moments when
operations will occur.
• The ALU would require that input data must arrive at the ports on or before these
moments of operations.
• The clock provides a signal for the ALU to synchronize its operations.
• The rising edges (or the falling edges) of the signal are used as the moments of
operations.
• Other components that provide data to the ALU also use the clock as a reference for their
operations.

3. Registers for Buffering the Input and Output Ports

The clock signal has helped to ensure that the input data are ready. There are still two more
challenges to the correctness of ALU operations.
• The precise operation of other components so that the input data are actually ready.
• The output data may be lost if the handling component is not ready to receive them.
Our solution is to introduce a small piece of memory called registers to buffer the input and
output.
• A buffer is a data storage that can be used to stored and retrieves data.
• The registers are a very fast type of buffer or memory.

46
• Input data arrives earlier than the schedule operation moment can be stored in the
registers and wait for the operation moment to occur. This reduce the challenge of
requiring the data providing components to meet the timing precisely.
• Output data is stored in a buffer to keep the data for a short while (between two operation
moments), allowing the data-receiving component to read the data.
The following figure shows the ALU input and output ports connected to registers.

• The register connected to the output port is considered as important because it contains
results of operations. It is named accumulator (ACC).
• The other registers are usually labelled with an index or a number starting from 0. The
two registers at the input ports are called R0 and R1.
• The operation of the registers is also under the control of signals.
• Each of the registers can hold 32 bits of data, which is consistent with the bus width of
the ALU.

The Loop-Back Feature


Consider an arithmetic expression of 2 + 3 + 6 + 8. The expression must be executed one
operator after another, from left to right. The result of the first operator 2 + 3 will be needed
for the next operation + 6. Many arithmetic operations share this feature of using the result of
a previous operation as the input to the next operation.
The ALU and the registers are rearranged as shown in the following figure. The output of the
ACC is looped-back to one of the input ports of the ALU. The result of the previous
operation can be fed into the input again for the next operation.

47
4. Register for Storing Instructions
A programmable computer supports the execution of programs, which contains a number of
instructions carefully crafted together. A programmable computer executes a program by
executing the individual instructions of the program, one instruction after another.
The Controller is the component responsible for actually carrying out the tasks to complete an
instruction. These tasks can involve one of the following.
• Command the ALU to carry out different arithmetic and logic operations.
• Read data from the memory or write data to the memory.
• Move data between registers. For example, moving the data from the ACC to an input
port of the ALU.
The Instruction Register (IR) is a component for storing the current instruction. This register
has no different from other registers, except that this is the place where the Controller is
looking for current job. After the next instruction is loaded from the memory, it is stored in
the Instruction Register.

48
The Form of Instructions
In our programmable computer, the instructions are in the form of binary numbers.
• The memory system has to store data in binary number format. The instructions are
stored also in the memory system. So the instruction form must be consistent with the
data form.
• A programmable computer designer needs to designate binary numbers to represent
different instructions. For example, if our computer supports five instructions: add,
subtract, negate, load data, and store data. We need to decide which binary
representation corresponds to add instruction and which other binary representation
corresponds to subtract instruction.
Assume that the designer has decided to use 32-bit binary numbers for representing
instructions. The following table shows a possible mapping for the five instructions.

Instruction 32-bit Binary Representation


Add 0000 0000 0000 0001 0000 0000 0000 0000
Subtract 0000 0000 0000 0010 0000 0000 0000 0000
Negate 0000 0000 0000 0011 0000 0000 0000 0000
Load 0000 0000 0000 0100 0000 0000 0000 0000
Store 0000 0000 0000 0101 0000 0000 0000 0000

The 32-bit binary representations above are known as operation code or opcode.
• The computer designer is free to designate different numbers for different instructions.
• The actual designation however has an impact on the efficiency of instruction execution.
• The instruction register should be large enough to hold every instruction.
• The more sophisticated programmable computer may support a larger number of
instructions. For each instruction, computer designers would designate an opcode.
• The whole set of instructions supported by a computer is called an instruction set.
• The richness of the instruction set of a computer determines its programmability. With
more variety in the instruction set, we can build programs to perform a greater range of
tasks.

49
5. Memory System for Data and Instruction Storage

The Memory System is essential for storing data and instructions. It is essential for storing
the results of program execution. The ACC can hold only 32 bits and not sufficient for the
task.

The following are the features of the memory system that was discussed previously.
• Each unit of memory is 32-bit. This size should be consistent with the size of other
components.
• Each memory unit has a unique address, which is also a number.
• The interface to the memory system consists of a data port, an address port, and control
signals. The data port is for the transfer of data. The address port is for specifying the
address. An example of control signals is to control whether the memory operation is
read or write.
• The memory system operation is synchronized with a clock signal. For the memory
system to operate correctly, the timing of the data arriving at the ports must be precise.
The data port and the address port are connected to other components of the programmable
computers.
To improve the operation resilience of the memory system, buffers, in the form of registers,
are added to the address port and the data port. They are called Memory Address Register
(MAR) and Memory Data Register (MDR).
• MAR will hold the address of the current memory operation.
• MDR will hold the data for the operation.

50
The problem now is to connect the Memory System to the ALU components. The following
figure shows how it is done.

A key feature of the design is a data bus that connects all the registers: IR, ACC, R0, MAR,
and MDR.
• The data bus allows data to be moved from the ACC to the MDR and then to the memory.
• The data bus allows data from the memory to be moved to the MDR and then to the ACC
for the next operations.
• These two data movement routes allow data to be moved between the calculation centre
of the programmable computer and the data storage centre.
• The MAR is connected to the ACC, allowing the address of memory operations to be
controlled by the results of operations.
• The MAR is connected to the IR, allowing the address of memory operations to be
controlled directly by instructions.
The Controller plays an important role in signalling the various components to operate
meaningfully. The Controller works according to the instruction that is read into the IR. If
the instruction is add, then the Control Unit sends appropriate signals to the ALU and other
registers to perform an add operation.

51
Instructions for Interacting with the Memory System
Our programmable computer is getting into shape. For the computer to operate with the
memory system, we need instructions for moving data between the Registers and the
Memory.
Two instructions are designed for the task: load and store. The load instruction moves a data
from an address location in the memory to the ACC. The store instruction moves the data in
the ACC to an address location in the memory.
Instruction 32-bit Binary Representation
Add 0000 0000 0000 0001 0000 0000 0000 0000
Subtract 0000 0000 0000 0010 0000 0000 0000 0000
Negate 0000 0000 0000 0011 0000 0000 0000 0000
Load 0000 0000 0000 0100 <16-bit memory address operand>
Store 0000 0000 0000 0101 <16-bit memory address operand>

The load and store instructions include a parameter that specifies the memory address to load
or to store.
• The parameter or the operand is part of the instruction, taking up the last 16 bits of the 32
bit binary representation.
• The instruction representation is designed this way to save space.
• These two 32-bit instructions contain both the opcode and the operand.
The following figure illustrates the steps to execute a load instruction.

We begin with the load instruction already loaded into the IR.
• The instruction stored in the IR contains an opcode and an address operand.
• The opcode part is checked by the Control Unit and understood to be a load instruction.
• The Control Unit signals the IR to send the address operand to the MAR, and the Control
Unit sets the R/W line of the memory system to Read
• The Memory system carries out the operation of reading a data from the prescribed
address. The data is sent to the MDR.
• The Control Unit signals the MDR to move the data to the ACC.

52
6. Von Neumann Architecture

We will return to the problem of the form of programs that we have not yet addressed. The
instructions in a program must be codified before it can be passed to the IR and the ALU.
The original form of the program, however, can vary quite a lot. The following shows some
early examples:
• Hardwired. Not programmeable.
• Punched film stock. (Zuse Z3 in 1941).
• Rewiring to achieve partial programmeablilty (Colossus in 1943 and ENIAC in 1944).
• Punched paper tape (Harvard Mark I IBM ASCC in 1944).
• Function table ROM (ENIAC in 1948).

The above designs show that certain characteristics about the approach of handling programs
in computers.
• Programs should be changeable so to alter computer operations.
• Programs should be able to store away for later and repeated use.
• Programs should be readily accessible to the processors of computer. The speed of
reading the programs should be fast.

The Von Neumann architecture specifies that the program and data will be stored together in
the Memory System. It allows the flexibility to re-programme a computer through
manipulating the program stored in memory.
• Programs can be easily changed through modifying the memory electronically.
• Programs are readily accessible by electronic signals.
• Programs may be stored indefinitely in memory.
• Stored programs allow them to modify themselves in operation. Programs code is simply
data in memory cells.
• Stored programs allow the likes of compilers and interpreters possible. The purpose of
these programs is to write programs.

53
Program code is now stored in memory and the system bus supports a data movement route
to move instructions from memory to the IR. The Control Unit can take the following steps
to move an instruction to the IR.
• Control Unit sends the address containing the next instruction to MAR
• Memory system retrieves the data of the address and sends the data to the MDR
• Control Unit moves the instruction from MDR to IR.

The Program Counter (PC)


The execution of a program is done sequentially. After a computer has executed one
instruction, it will execute the next one. The above design must be revised again so that the
computer knows the address of the next instruction to be executed. The following figure
shows our revised design.

The difference is the addition of a new register called program counter.


• The program counter stores the address of the next instruction to be executed.
• It is connected to the data bus so that the stored address can be moved to the MAR for a
memory operation to read the next instruction.
• The program counter will increase by one upon receiving a signal from the Control Unit.
• The program counter may also be reset to 0, which is the first address of the Memory
System.
The Control Unit now takes the following steps to move an instruction to the IR.
• Control Unit moves the address in the PC to MAR
• Memory system retrieves the data of the address and sends the data to the MDR
• Control Unit moves the instruction from MDR to IR
• Control Unit signals the PC to increase by one

54
7. System Bus for Connecting the Registers

The registers in the programmable computers are all connected together with the system bus.
• The system bus is the most important highway for data movement in the computer.
• The system bus allows a pair of the registers to move data between them.
• At any one moment, only one such data movement route can operate.
• This is a limitation of the system bus design.
• Although a system bus can connect many registers, only two of them can exchange data
at any one time.

Data movement can be speed up by having them to happen in parallel. The system bus can be
replaced by the fully connected network. Each pair of registers now has a dedicated highway.
There are a few drawbacks to this approach:
• The fully connected network is clearly a lot more costly to build
• A register can only handle one data movement at one time even if it is connected to all
other registers.
• Some connections have no use in the operations of the computer.

The performance of the system bus is an important factor to the working of the programmable
computer.
• The system bus operates according to the clock and signal from the Controller.
• The Controller determines which pair of components are to establish connection and to
move data.

55
8. Input and Output Controllers

The design of our basic programmable computer is completed by adding the input and output
controllers. These controllers are connected to the system bus.

The components that are connected to the input and output are considered as peripherals.
• Input devices such as keyboard and mouse, and output devices such as monitor are
common.
• Some IO devices can operate both as input and output devices.
• A hard disk is a memory device that can do both input and output.

The programmable computer has two levels of memory. The memory system that connects
directly to the system bus through the MAR and MDR is called the main memory or
primary memory. The main memory stores data and program for program execution. The
memory system that connects through IO controllers is called secondary memory. Second
memory is usually designed for long-term storage.

56
9. Central Processing Unit (CPU)

For ease of design, implementation, and production, some components of the programmable
computer are integrated closely to form a single component called the Central Processing
Unit (CPU).
The following figure shows that the CPU includes the ALU, the main registers, the system
bus and the controller.

Leaving the main memory out of the CPU has advantages:


• The memory physical size is large and to include memory in the CPU makes design very
difficult.
• Leaving the memory as a separated component allows the memory size to expand
independently.
• The CPU can operate more independently from the Memory System. It allows these two
components to operate at different speeds.
It also has an important disadvantage:
• The data movement route from the registers to the Memory System becomes longer and
data movement will take longer time to complete.

10. Resolving the Differences in Operating Speeds

Components of a computer system operate at different speed. Some are designed that way
and others are controlled by external factors.
• ALU speed is decided by its designed clock speed.
• Main memory speed is decided by physical characteristics of the memory system.
• Secondary memory speed is also decided by the design, and it is of a slower data transfer
speed (operating speed) than the main memory.
• Input speed is partly decided by the input device and partly decided by the user.
• The output device determines the output speed.
• The system bus decides data transfer speed between components.
Coordination is required to allow two or more components to communicate and to work
together. There are some rules that the components follow when they communicate, and this
often involves one component waiting while other component are doing their work.

57
Appendix A. Amdahl's Law
In this appendix, we will discuss how to estimate the performance or speed up of a computer
system from its components with Amdahl's Law.
The overall speed of a computer system is limited by its slowest components in the chain of
operation.
• For example, a piece of data to be moved from the secondary memory to the primary
memory and then executed by the CPU.
• This chain of operation will consist of vastly different speed of operation.
• The secondary memory is the slowest and therefore the speed of this operation is limited
by the speed of the secondary memory.
A computer system will carry out many operations.
• Moving data from the secondary memory and executed by the CPU is only one of them.
• Another operation may be simply moving data from the primary memory to the CPU and
executed there.
If a computer system has only these two operations, then the overall speed of the computer
system is depending on the frequencies of the two operations and the individual speed of the
operations.
Clearly if one operation is very slow, for example the first operation that involves the
secondary memory, the overall speed is effectively determined by the first operation.
Many people are hoping for a computer speed-up by replacing a slower component with a
faster component. The CPU is often the target of a computer upgrade such as replacing a
CPU with faster clock rate. Is replacing with a faster CPU an effective method to speed up a
computer system?
_ _
= 1
ℎ _ _
1
, = 2
1− +
1
= 3
1−

Let the speed-up be eq (1)


For a computer system with many components, the speed-up of one component has an effect
on the overall speedup according to the Amdahl's law, that is eq (2).
The variable f stands for the fraction of a computation operation being enhanced. For
example, if CPU is upgraded, then f is the fraction of operation that involves the CPU. The
variable s stands for the speedup rate in the upgrade.
The law has a couple of consequences:
• If f is small, then the speedup is not so useful.
• Even if s is very large, the overall speedup is still bound by eq (3).

58
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 5. Case Study: Little Man's Computer

This chapter will discuss the design and evaluation of a programmable computer through
implementation of an emulator.
We have so far come up with the following programmable computer design.

In this chapter we will construct an operational computer based on the above design.
The computer is called Little Man's Computer (LMC). There are several significant
deviations from the previous design to allow us to look in other issues. LMC comes with an
instruction set, allowing the writing of LMC programs.

1. Design of the LMC

The LMC is simpler than the above design of programmable computer.


• LMC has one data registers (ACC). There is no other data register (so R0 above is not
found in LMC)
• One of the operands of addition/subtraction operations must be the ACC and another one
from the memory system through the MDR.
• LMC is a decimal based computer. The data and instructions follow positive decimal
representation. As a matter of consistency, the ALU, registers, memory system are
designed to handle decimal numbers.

59
The following figure shows the design of the LMC.

The following lists the major features of the LMC:


• Three-digit decimal representation is used in data and instructions.
• The registers can hold 3-digit positive decimal.
• The ALU supports the operations of addition and subtraction. Overflows are not reported
and the carry is simply discarded.
• The ALU sends out two types of status: whether the result is zero and whether the result
is positive.
• The memory system has 100 addresses, from 00 to 99. Each address can store a 3-digit
positive decimal number.

LMC as an analogy of a programmable computer

LMC is a common teaching tool used by many universities in teaching computer architecture.
The original LMC would be described in a fictional mood. This is not the approach adopted
by this course. However, the following describes LMC in the original manner for your
interest.
The following lists the major components of the LMC, and the corresponding components in
a real computer.
The little Man
The real computer Role
Computer
Arithmetic and Performs arithmetic and logical operation such as addition and
Calculator
logical unit subtraction
Controls the steps (when and where) to load the data from
Little man Control unit memory into the arithmetic and logical unit (the calculator in the
Little Man Computer).
They are used to store instructions and data. Each mailbox has a
Mailboxes Memory label starting from 00 to 99. Mailbox number 00 is equivalent to
memory address 00.
Instruction This is used to keep track of which program line is being
Program counter
location counter executed.
Input controller and This is used to receive data from the outside world into the
In-basket
buffer computer.
Out-basket Output controller This is used to send data from the computer to the outside world.
60
The following figure shows the components in the LMC in an illustrative style.

The Little Man is hidden inside a room where there are a few specific connections to the
outside world only.
Notes about the various components:
• There are 100 mailboxes, each with an address from 00 to 99. Each mailbox address is
therefore represented with two digits. Each mailbox can hold a three-digit decimal
number, which is the content of a mailbox.
• The calculator is available for doing simple arithmetic and storing data temporarily. The
display on the calculator is 3 digits wide.
• The location counter is a hand counter for the little man to keep track of its work. The
counter keeps a two digits number (from 00 to 99). The counter has a reset button outside
the room, allowing an external instruction to reset the counter.
• Other than the reset button, the only other connections to the outside worlds are the in-
basket and the out-basket.
• There is of course the Little Man who will perform tasks that will be described later.
A user can communicate with the Little Man by placing a 3-digit data in the in-basket,
however it is up to the Little Man to read at a particular time. The Little Man can also leave a
3-digit data in the out-basket.
No other form of communication with the Little Man is possible.

61
2. LMC Instruction Set

The instruction set contains instructions that can be used to compose programs. The
instruction set determines the richness and variety of programs that a programmable computer
can support.
There are several issues need to be consider in the design of the LMC instruction set.

The types of instructions


The kinds of instructions to include in the instruction set design are dependent on the intended
functions of LMC. At the minimal, the instructions should allow the written of complete
programs. They will include instructions for moving data from and to the memory, input and
output, arithmetic and logic operations, and stopping the program.

The format of instructions


LMC is based on the von Neumann architecture and programs are stored in memory. The
instructions must be in the same form as the mailboxes that are 3-digit data units.
There are at least two alternatives:
• Use a single 3-digit data unit for all instructions. Assign the first digit as the operation
code, and reserve the next two digits as the operand.
o Advantage: economical use of memory because all instructions are one data unit long.
o Disadvantage: The number of possible instruction is small. Having one digit for the
opcode implies that there are 10 different instructions at most. The 2-digit operand is
however sufficient for covering the 100 addresses
• Use two 3-digit data units for all instructions. The first data unit is designated as
operation code. The next data unit is reserved for an operand.
o Advantage: a 3-digit number has sufficient combinations of opcode for assigning a lot
of instructions.
o Disadvantage: costly in terms of storage because each instruction takes up 2
addresses. In a computer with only 100 addresses it is quite expensive.
The LMC has adopted the first alternative. Each instruction has two parts: 1-digit opcode and
2-digit operand.
The following lists the LMC instruction set. The XX indicates the position of the operand for
the instruction.
Instruction Code Remarks
Load 5XX Load from mailbox to calculator
Store 3XX Store in mailbox from calculator
Add 1XX Add from mailbox to calculator
Subtract 2XX Subtract from calculator the mailbox value.
Input 901 Input
Output 902 Output
Halt 000 Coffee Break or Halt
Branch 6XX Branch unconditionally
Branch if Zero 7XX Branch if zero
Branch if Positive or Zero 8XX Branch if positive or zero
Data A location for data storage
62
LOAD and STORE instructions

The first instructions to design are the LOAD and STORE instructions that move data
between the Memory and the ACC.

Instruction LOAD
Opcode 5
Instruction Format 5 XX (XX is the address to load to the ACC)
Example 512 is an instruction to cause the data in memory address 12 to be copied to the ACC
• The little man goes to the mailbox and retrieves the value in the specified address.
• The little man then enters the value in the calculator.
• The previous value in the calculator is therefore overwritten.
• The data in the specific address of the mailbox remains the same.

Instruction STORE
Opcode 3
Instruction Format 3 XX (XX is the address to store the data in ACC)
Example 312 is an instruction to cause the data in the ACC to copy to the memory address 12

• The little man goes to the calculator and retrieves the value there.
• The little man then place the value in the mailbox according to the specified address.
• The previous value in the mailbox is therefore overwritten.
• The data in the ACC remains intact

ADD and SUBTRACT instructions

These two instructions allow LMC programs to carry out arithmetic operations.

Instruction ADD
Opcode 1
Instruction Format 1 XX (XX is the address containing the second operand)
Result ACC will store the sum of the ACC and the data in memory address XX
Example 120 is an instruction to sum the ACC and the data in address 20 and the result is
stored in ACC

• The little man goes to the mailbox and retrieves the value in the specified address.
• The little man then adds the value to the value already stored in the calculator. The result
of the addition is stored in the calculator.

63
Instruction SUBTRACT
Opcode 2
Instruction Format 2 XX (XX is the address containing the second operand)
Result ACC will store the difference between ACC and the data in memory address XX
Example 220 is an instruction to subtract the data in address 20 from the ACC and the result is
stored in ACC

• The little man goes to the mailbox and retrieves the value in the specified address.
• The little man then subtracts the mailbox value from the value already stored in the
calculator. The result of the subtraction is stored in the calculator.
• It is possible to end up with a negative value in the calculator. Negative values are
allowed in the calculator, but not in any other components in the LMC.

IN and OUT instructions

These two instructions allows LMC programs to input and output data.

Instruction IN
Opcode 901
Instruction Format 901 (this instruction is an exception because it has no operand)
Result The data in the input buffer is copied to the ACC

• The little man goes to the in-basket and picks up a value there.
• The little man then moves to the calculator and enters the value.
• The previous value in the calculator is therefore overwritten.
• It is possible to have multiple values left in the in-basket. These values are picked up by
the little man in a first-come-first-served basis.

Instruction OUT
Opcode 902
Instruction Format 902 (this instruction is an exception because it has no operand)
Result The data in the ACC is sent to the output

• The little man goes to the calculator and retrieves the value there.
• The little man goes to the out-basket and places the calculator value there.
• The value in the calculator remains there.
• It is possible to have multiple values placed in the out-basket. These values preserve their
original order when the users are receiving them.

HLT and COB instructions

These two instructions have the same effects. They stop the LMC computer.

Instruction HLT or COB


Opcode 0 (the operand part is ignored)
Result the LMC is stopped

Without this instruction, a LMC program would run indefinitely.

64
BRANCH instructions
So far LMC programs must be executed in a sequence. The instruction location counter (PC)
always increases by one after the completion of an instruction. The following branch
instructions however allow the instruction location counter to be changed. The next
instruction to execute can be another address.
Unconditional branch instructions force the execution to move to another address.
Conditional branch instructions moves the execution to another address only if certain
condition is met.

Instruction UNCONDITIONAL BRANCH


Opcode 6
Instruction Format 6 XX (XX is the destination address for the branch)
Result PC will store the destination address and so the next instruction to be executed
next is XX
Example 630 is an instruction to cause the execution to branch to address 30.

• The little man goes to the instruction location counter and store the address part of the
instruction there.
• After the completion of this instruction, the next instruction will be retrieved from the
address stored in the instruction location counter. The Little Man expects a valid
instruction stored in the address there.
• For example, the instruction 6 2 3 means that the value 23 is stored in the instruction
location counter. The next instruction to execute is the content stored in the mailbox
address 23.

Instruction BRANCH ON ZERO


Opcode 7
Instruction Format 7 XX (XX is the destination address for the branch)
Result If the ACC is zero (that is, the previous operation's result is zero) PC will store the
destination address and so the next instruction to be executed next is XX

• The little man goes to the calculator and checks the value stored in the calculator to see if
it is zero.
• If the value is zero, then the little man goes to the instruction location counter and store
the address part of the instruction there.
• If the value is not zero, then to the little man the instruction is completed.

Instruction BRANCH ON POSITIVE


Opcode 8
Instruction Format 8 XX (XX is the destination address for the branch)
Result If the ACC is zero or positive (that is, the previous operation's result is zero or
positive) PC will store the destination address and so the next instruction to be
executed next is XX

• The little man goes to the calculator and checks the value stored in the calculator.
• If the value is zero or positive, then the little man goes to the instruction location counter
and store the address part of the instruction there.
• If the value is not zero or positive, then to the little man the instruction is completed.

65
3. Example LMC Programs

Input and Print


The following program reads a number from the input, and then prints the number to the
output. The program finally stops.

Example: Input and Print


It is important to note that the first address of LMC program is always zero. This is the
place where LMC execute would start after reset.
00 901 ; IN
01 902 ; OUT
02 000 ; HALT

Adding Two Numbers


In the following programs, two input data are sum and the result is printed. Note that the first
input data is stored at address 99, while the second data can stay at the ACC. For the addition
operation to happen, one operand must come from the ACC and another from the memory.

Example: Adding two numbers


00 901 ; INPUT #1
01 399 ; STORE IN ADDR 99
02 901 ; INPUT #2
03 199 ; ADD ADDR 99
04 902 ; OUTPUT
05 000 ; STOP
99 ; DATA

66
Comparing Two Numbers
The following compares two numbers and prints the larger number.

Example: Comparing two numbers


00 901 ; INPUT #1
01 311 ; STORE IN ADDR 11
02 901 ; INPUT #2
03 312 ; STORE IN ADDR 12
04 211 ; #2 - #1
05 808 ; BRANCH IF POSITIVE TO 08
06 511 ; LOAD #1
07 609 ; BRANCH TO 09
08 512 ; LOAD #2
09 902 ; OUTPUT
10 000 ; STOP
11 000 ; DATA
12 000 ; DATA

Program Loader
Program loader is an important component of an operating systems. If we were to develop an
operating system for LMC, this would be the first program needed.

Example: Program Loader


00 BR 50
50 IN
51 STO 00 ; storing the entered instruction from 00 onwards
52 BRZ 57 ; if the input is 000, branch to 57 to begin execution
53 LDA 51
54 ADD 99
55 STO 51
56 BR 50
57 LDA 51
58 STO 60
59 LDA 98
60 STO 00
61 LDA 97
62 STO 51
63 BR 00
97 DAT 300
98 DAT 650
99 DAT 01

67
4. Program the LMC

To program the LMC, a programmer can follow the steps below:


• Directly enter the instructions as 3-digit numbers into memory locations. The first
instruction should be at address 0.
• The programmer presses the Reset button to start. The Reset button sets the Program
Counter to zero, and this is where the next instruction is executed.
Mnemonics are abbreviations for the instructions that would make a LMC program easier to
read. A program is often supplemented with comments to help the readers to understand the
purpose of each instruction.
The following table shows the instruction set of LMC with mnemonics.

Mnemonics Code Remarks


LDA 5XX Load from mailbox to calculator
STO 3XX Store in mailbox from calculator
ADD 1XX Add from mailbox to calculator
SUB 2XX Subtract from calculator the mailbox value
IN 901 Input
OUT 902 Output
COB or HLT 000 Coffee Break or Halt
BR 6XX Branch unconditionally
BRZ 7XX Branch if zero
BRP 8XX Branch if positive or zero
DAT A location for data storage

To program the LMC with mnemonics, a programmer can follow the steps below:
• Write LMC program using mnemonics using an editor.
• Use a program known as an assembler to assemble the 3-digit instructions from
mnemonics. The assembler will ensure that the first instruction is at address zero.
• The programmer presses the Reset button to start. The Reset button sets the Program
Counter to zero, and this is where the next instruction is executed.
Modern day computer programmers seldom use assembler to write program. High-level
programming languages such as C and Java are used instead. Programmers use compilers
and linker to convert a program written in high-level program into instructions that can be
executed by the CPU.

68
Conversion of a Selection Statement
The following shows a simple program written in C.

Example: Compiling a C program into LMC


The following is a C program and it is compiled into a LMC program as below.
int x;
scanf("%d", &x);
if (x > 5)
printf("0");
else
printf("1");

In LMC
00 IN ; Input data
01 STO 99 ; Store in 99
02 SUB 98
03 BRP 06 ; Branch if x>=5
04 LDA 97 ; Load '1'
05 BR 08
06 BRZ 04 ; Branch if x==5, output '1'
07 LDA 96 ; Load '0'
08 OUT ; Print
09 HLT;
96 DAT 00 ; Constant 0
97 DAT 01 ; Constant 1
98 DAT 05 ; Constant 5
99 DAT

Conversion of a Repetition Statement

Example: Compiling a C program into LMC


The following shows a C program with a loop and it is compiled into a LMC program as
below.
int x = 0;
while (x < 10) {
printf("%d", x);
x++;
}

In LMC
00 LDA 99
01 SUB 98 ; Calculate X - 10
02 BRZ 08 ; Jump to after the loop
03 LDA 99 ; Load X
04 OUT ; Print X
05 ADD 97
06 STO 99 ; X = X + 1
07 BR 00 ; Jump to the top of loop
08 HLT
97 DAT 01 ; Constant 1
98 DAT 10 ; Constant 10
99 DAT 0 ; Variable X

69
5. Benefits and Hazards of Von Neumann Architecture
In a stored program architecture computer such as the LMC, both data and program exists in
the memory system.
This design has a number of benefits:
• Simpler computer design. A single memory system is needed for both data and program
instructions. Otherwise, separated memory systems would be needed, and each would
need its own input/output devices.
• Allows a programmer to write instructions that modify or create other instructions. This
could reduce program size and improve programmability. The operating system example
for LMC below would not be possible without self-modifying instructions.
However, this can cause a problem.
• There is no indicator in an individual address to signify whether it is an instruction or
data.
• LMC has no way to predict if an individual address contains an instruction or a data. The
LMC can only assume it to be a valid instruction.
• LMC can find out that the address contains an invalid instruction only after reading it in.
For example, 903 is not a valid instruction. The LMC cannot execute this and will
possibly cause a system crash or exception.
• It is however that a data happens to be also the same value as a valid instruction, causing
the LMC to perform unpredictable actions.
• Only the programmer has this knowledge and so it is up to the programmer to take care.
When a programmer write programs using mnemonics, one can use the DAT mnemonic store
any constant in an address. The DAT signifies that the programmer does not expect this
mailbox to be fetched and executed by the Little Man. However, the LMC can still execute
the data if the data is a valid instruction.

Example: A Simple Operating System for LMC


The following program can read in a new LMC program from the input and execute the
new program. It involves storing the new program from address 0 and then branch to the
new program.
00 BR 50
50 IN
51 STO 00 ; 300
52 BRZ 57
53 LDA 51
54 ADD 99
55 STO 51
56 BR 50
57 LDA 51
58 STO 60
59 LDA 98
60 STO 00
61 LDA 97
62 STO 51
63 BR 00
97 DAT 300
98 DAT 650
99 DAT 01
70
The Von Neumann bottleneck refers to the limited throughput between the CPU and the
memory.
• The connection between the CPU and the memory carries both data and program
instructions. Program instructions and data cannot be assessed at the same time.
• Often instructions cannot be executed because the data is not yet read into the CPU.

6. Execution of LMC Instructions and Micro-Operations


LMC is designed to follow a cycle of instruction execution as discussed in the previous
chapter.
• In each cycle, the execution of one instruction is completed.
• There are two parts in the cycle. The first part is fixed regardless of the instruction type.
The second part depends on the instruction type.
The first part of the cycle is shown in the following figure.

The aim is to move the next instruction to the IR.


• (1 PC) Move data from the PC to the MAR and ask the Memory to load the next
instruction.
• (2 Inst) The Memory System retrieve the required data and send to MDR
• (3 Inst) The Control Unit moves the data from MDR to IR.
The second part of the cycle depends on the instruction. Assume that the instruction is LDA.
The execution of the instruction is shown in the following figure.

71
• (1 Check) The Control Unit checks the IR and then act according to the instruction.
• (2 Addr) The Control Unit moves the address part of the instruction from IR to MAR
• (3 Data) The Memory System retrieve the required data and send to MDR
• (4 Data) The Control Unit moves the data from MDR to ACC
The execution of a LMC instruction involves a number of operations. These operations are
called micro-operations. Many of these micro-operations involve moving data between
components, especially the registers. This is a data movement perspective of computer
operation.
The operation of the programmable computer is essentially boiled down to moving data
between components. The computer will execute instructions faster if the data movement is
faster. The following lists the common data movement patterns involved in LMC instruction
execution:
• PC to MAR (the address of next instruction)
• Memory System to MDR (memory read data)
• MDR to Memory System (memory write data)
• MDR to IR (the current instruction)
• MDR to ACC (data)
• ACC to MDR (data)
• MDR to MAR (memory address)
The Control Unit (CU) is the coordinator of computer operations. It sends signals to various
components so that micro-operations can be carried out meaningfully. The following
illustrates the step-by-step actions of the CU taken to execute a LDA instruction.
• Control Unit moves the address in the PC to MAR
• Memory system retrieves the data of the address and sends the data to the MDR
• Control Unit moves the instruction from MDR to IR. The instruction stored in the IR
contains an opcode and an address operand.
• Control Unit signals the PC to increase by one
• The opcode part is checked by the Control Unit and understood to be a //load//
instruction.
• The Control Unit signals the IR to send the address operand to the MAR, and the Control
Unit sets the R/W line of the memory system to Read
• The Memory system carries out the operation of reading a data from the prescribed
address. The data is sent to the MDR.
• The Control Unit signals the MDR to move the data to the ACC.
The first four steps are common to all instructions and the last steps are specific to the
instruction being executed. After the last step, the computer operation returns to the first step
to execute the next instruction. Computer operations are carried out in an unceasing cycle.

72
7. Fetch and Execution Cycle

Computer operations are carried out in an unceasing cycle of micro-operations execution that
is coordinated by the Control Unit.
This cycle is known as the fetch and execution cycle.
• This cycle repeats indefinitely until the computer is halted.
• The first part is known as the fetch part. The purpose is to fetch the next instruction into
the IR. The specific micro-operations of this part are always the same.
• The next part is known as the execution part, which carries out different actions
according to the specific instruction being executed.
The following figure illustrates graphically the fetch and execution cycle.

In the fetch and execution cycle, each step in the cycle is a micro-operation.
The rudimentary micro-operations performed by the CPU include the following:
• Invoke a function on the memory system
o Fetch data from a specific memory location
o Store data to a specific memory location
• Invoke a function on the program counter (add one)
• Invoke a function on the ALU
o Carry out an arithmetic or logic operation
• Transfer data from one register to another register.

73
8. Register Transfer Language (RTL)
The Register Transfer Language (RTL) provides a concise language for us to describe
micro-operations in our programmable computer.
• It details the data movement routes and pattern,
• It makes the number of steps and time taken to execute an instruction becoming more
visible.
• It explains the steps in the fetch-execute instruction cycle.
The RTL uses different notations for different meanings:
• Capitalized names are registers or other components in the programmable computer. For
examples, ACC, R0, IR, and PC.
• Square brackets [ ] indicate one part of a register or the address of memory. For example,
IR[address] means the contents of the IR.
• The equals sign = indicates that the content of the memory address or register is a certain
value. For example, M[5] = 3 means the content of memory location 5 is now assigned
the value 3.
• The arrow > indicates movement of data. For example, M[1] + M[2] > M[R1] means the
contents of memory locations 1 and 2 are now added together and put back into a memory
location as specified by R1.

Example 1: Fetch Instruction


The following RTL describes the fetching of an instruction and putting it into the instruction
register, IR.

Example: RTL
PC -> MAR Send instruction address to MAR
M[MAR] -> MDR Read the current instruction
MDR -> IR Copy the instruction to the IR
PC + 1 -> PC Point to the next instruction

74
Example 2: ADD Instruction
The following RTL describes steps of executing an instruction of ADD A, ACC, which adds
the data of a memory address A to ACC. The result is stored in ACC.

Example: RTL
PC -> MAR Send instruction address to MAR
M[MAR] -> MDR Read the current instruction
MDR -> IR Copy the instruction to the IR
PC + 1 -> PC Point to the next instruction

IR[address_field] -> MAR Send the operand address A to MAR


M[MAR] -> MDR Read the operand from memory
MDR + ACC -> ACC Perform the addition and put it to ACC

The above assumes that A is part of the instruction and so it is available in the IR.
• The first four lines are the fetch phase.
• The Control Unit then copies the address operand in the IR and puts it in the MAR.
• The Memory System then sends the data of address A to MDR.
• Finally, the addition of MDR to ACC is carried out by the ALU.

Example: RTL
Question: Write down the RTL for an instruction ADD R0, 4. The instruction adds the
content of Memory Address 4 to R0.

PC -> MAR Send instruction address to MAR


M[MAR] -> MDR Read the current instruction
MDR -> IR Copy the instruction to the IR
PC + 1 -> PC Point to the next instruction

IR[address_field] -> MAR Send the operand address A to MAR


M[4] -> MDR Read the operand from memory
MDR + R0 -> R0 Perform the addition and put it to R0

Assuming that the address of the instruction ADD R0, 4 is 2000, the data in address 4 is 10,
the data in R0 is 30. The content of the major registers after the execution of the instruction
are shown below.

Register Data/Content
PC 2001
IR Holding the instruction code of ADD R0, 4
R0 40
MAR 4
MDR 10

75
Exercise: LMC program
Question: Given the following LMC program.

00 IN ; 901 Input the data


01 STO 10 ; 310 Store to location 10
02 IN ; 901 Input the data
03 STO 11 ; 311 Store to location 11
04 ADD 10 ; 110 Add with location 10
05 OUT ; 902 Output
06 BR 00 ; 600 Branch to 00
07 COB ; 000 End of Program
10 DAT 000 ; DATA
11 DAT 000 ; DATA

Assume that the PC is 03, write down the steps in executing the instruction STO 11 with
RTL.

Answer:
PC -> MAR
M[MAR] -> MDR
MDR -> IR
IR[ADDRESS] -> MAR
A -> MDR
MDR -> M[MAR]
PC + 1 -> PC

Exercise: LMC program

Question: With the same LMC program, write down the steps in executing the instruction
BR 00 at address 06 with RTL.

Answer:
PC -> MAR
M[MAR] -> MDR
MDR -> IR
IR[ADDRESS] -> PC

Exercise: LMC program

Question: The LMC is executing the instruction BR 00 at address 06. Write down the
content of MAR, MDR, IR, and PC after the execution.

Answer:
We should look at the RTL and know what values have been loaded to these registers.
MAR = 06
MDR = 600
IR = 600
PC = 00

76
Benefits of RTL

Studying the Register Transfer Language enables us to understand the effort involved in
execution of an instruction. The RTL reveals that some instruction requires more effort.
Generally RTL can help us to determine the following.
• Normally a clock controls the execution of the steps in RTL, and so each step can be
completed in one clock cycle. RTL allows us to easily estimate the clock cycles per
instructions.
• With RTL, designers can investigate if any steps can be carried out in parallel.
Depending on the architecture of the CPU design, some CPU supports more than one bus
between its components and so some steps may be carried out in parallel. In this case, the
number of clock cycles for some instruction can be reduced.
• The RTL can specify the execution of every instruction in a procedural manner, which
can be used in the implementation of the control unit of the CPU. The control unit of the
CPU is responsible for coordinating the data movement and the components, and it may
be implemented as a programmable micro-controller in RTL.
If one step needs one clock cycle to complete, then the above examples show that some
instructions like STO would take 7 steps (cycles), and other instructions like BR would take 4
steps (cycles).

77
9. LMC Performance Analysis

The following table summarizes the theoretical number of cycles required for every LMC
instructions.

Mnemonics Execution Cycles Memory Operations


LDA 7 2
STO 7 2
ADD 7 2
SUB 7 2
IN 5 1
OUT 5 1
COB or HLT 3 1
BR 4 1
BRZ 4 1
BRP 4 1

The theoretical figures are worked out based on the following assumptions:
• Each data movement between registers takes one cycle.
• Each memory system operation takes one cycle.
• Each input and output operation takes one cycle.
If the number of execution cycles of each LMC instruction is known, then the speed of LMC
program execution can be easily calculated.

Exercise: Execution Speed of LMC Programs

Question: You have written a LMC program. In one execution of the program, you
counted the number of instructions executed: there are 300 LDA or STO instructions, 120
ADD or SUB instructions, 20 BR, BRZ, or BRP instructions, and 5 IN or OUT
instructions. If the CPU clock rate is 100 MHz, calculate the time taken to execute the
program

Answer:
Total number of cycles is calculated from the summation of number of cycles for each
instruction.
= 300 x 7 cycles + 120 x 7 cycles + 20 x 4 cycles + 5 x 5 cycles
= 3045 cycles
The clock rate is 100 MHz, which means 100 M cycles per second
The time take to execute the program is 3045 / 100 M = 0.00003045 seconds
The running time of a program not only depends on the total number of instructions executed,
it also depends on the composition of the instructions.

78
Exercise: Execution Speed of LMC Programs

Question: Both Anders and Betsy have written a LMC program to find out the square of a
number.
Anders' program execution has involved 40 instructions, including 15 ADD/SUB, 20
LDA/STO, 3 BR/BRZ/BRP, and 2 IN/OUT.
Betsy program execution has involved 42 instructions, including 14 ADD/SUB, 18
LDA/STO, 8 BR/BRZ/BRP, and 2 IN/OUT.
Which program is better in terms of execution speed?

Answer:
Total number of cycles is calculated from the summation of number of cycles for each
instruction.
Anders' program
= (15 + 20) x 7 cycles + 3 x 4 cycles + 2 x 5 cycles
= 267 cycles
Betsy's program
= (14 + 18) x 7 cycles + 8 x 4 cycles + 2 x 5 cycles
= 266 cycles
Betsy's program ran faster, even if the total number of instructions is more.

Using the cycles per instruction is a useful tool to evaluate the performance of a program. If
the instructions take fewer cycles to execute, then program execution can be faster. This can
be achieved by improving computer design.
LMC is a useful abstraction of real computers. However, the cycle per instruction in a real
computer is not simply the number of RTL steps. Here are the differences:
• Memory system normally takes longer than one CPU cycle to perform load/store.
• Program counter increment can occur at the same time as another micro-operation, and it
does not require one cycle.

79
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 6. Technologies of Computer Components

This chapter will discuss the technologies for real computer systems. The discussion on the
LMC has concluded with four major components in computer operations.
• CPU or processor: executing instructions
• Bus: data movement for executing instructions
• Memory: data storage and retrieval
• IO: data input and output
This chapter will discuss the technologies developed for these four major components.

1. CPU Technologies and Manufacturing Process

CPU is the component in a computer system that executes instructions of a computer


program. It typically consists of two core components: the arithmetic and logic unit (ALU)
and the control unit (CU).
Depending on the current trend of CPU design, a CPU may also have other components such
as:
• Memory management unit (MMU): for controlling data transfer in and out of the CPU.
• IO control unit: for controlling peripheral devices.
• Cache memory: an internal fast memory structure for mirroring data in the main memory
system.
The CPU is a highly sophisticated electronic device based on complex circuitry. The physical
appearance of a CPU is often a chip or a set of chips. A chip is a package containing
integrated circuit, which means electronic circuit manufactured by depositing or diffusion of
chemicals on a thin substrate of a semi-conductor material such as silicon.

The process lithography describes the process of "printing" a circuit on a semi-conductor


substrate. CPU designers first design the complete circuit for the ALU and the CU on a
computer. Then the circuit as an image is printed on a CMOS wafer, which is a piece of
circular and chemically treated semi-conducting material. Usually, a wafer is large enough so
that many units of CPU are printed and then cut out for packaging.
The current technology of integrated circuit manufacturing is based on photolithography, a
process that is similar to photocopying an image of circuitry onto the wafer.

80
The entire manufacturing process takes place in highly controlled environment in special
manufacturing plants. The duration of the process is typically around 2 months. Here are the
common steps taken:
• Wafer preparation: crystallisation of pure silicon
• Wafer processing: printing circuitry onto silicon wafer
• Die cutting and attachment: units of CPU die are cut from wafer
• Chip packaging and testing

CPU cost is usually a significant part of the cost of a computer system. The cost of a CPU
chip depends on a few factors, and the most important ones are:
• Maturity of the manufacturing process.
• Size of the CPU chip.
• Raw materials.
• Competition in the market.
The current manufacturing process in 2014 is called the 14-nanometre process (14 nm). The
figure is roughly indicative of the (half) distance between features in the printed circuitry.
One important feature of a matured process is the yield. A very new and immature
manufacturing process produces more defects in the wafers and the dies. The overall cost is
therefore elevated to cover the loss caused by the defects. The following formula gives an
estimation of the cost of a die.
WaferCost
DieCost =
Dies _ per _ wafer × Die _ yield

In 2009, processing a 300mm wafer costed around US$2800 but a 150mm costed less than
US$450 (Reference: GSA Wafer Fabrication Pricing Reports). The 300mm and 150mm are
diameters of a circular wafer. The number of dies per wafer can be estimated using the
following formula.

81
π ×(Wafer_diameter / 2)2 π ×Wafer _diameter
Dies_ per _wafer = −
Die_ area 2×Die_area

Exercise: Dies per wafer

Question: The die size of an Intel core i7 is 263 mm square, calculate how many
dies can be cut from a 300 mm wafer.

Answer:
π × (300 / 2)2 π × 300
Dies _ per _ wafer = − = 268.7 – 41.1 = 227 dies
263 2 × 263

The die yield is dependent on the wafer yield (how many wafers are defected) and the defects
per unit area in the manufacturing process. Typical range is 0.3 to 0.6 for new processes. A
large die makes it more likely that a defect occurs in the area occupied by the die.

Miniaturization and Performance


Using a smaller die size in CPU chip manufacturing can increase die yield. However, the size
must be sufficient for printing the entire circuitry. The trend is to scale down the circuitry
with methods such as optical means so that the resulting die is smaller. The technology
challenge is to make the features on the circuitry as close together as possible, while the chip
is still operating smoothly.
The semi-conductor manufacturing process has undergone relentless miniaturization since the
1970s. The following shows the major process stages.
Year 1971 1975 1982 1985 1989 1994 1995 1998 1999 2000 2002 2006 2008 2010 2012
Process (nm) 10000 3000 1500 1000 800 600 350 250 180 130 90 65 45 32 22

The miniaturization has a number of advantages:


• Improve die yield: smaller die size
• Reduce power consumption: fewer electrons can drive the circuitry
• Increased potential highest clock rate: less time to travel for signal to travel
The most direct way to improve CPU performance is to increase the clock rate. The
execution of instructions depends on a certain number of micro-operations, and making the
clock rate faster can reduce the time to complete the execution.
Clock rate cannot be increased indefinitely. The signals in the CPU require time to switch
from one state to another and the time is dependent on the current electronic technologies and
physical laws. In addition, heat is dissipated in state transition and so more heat is generated
with higher clock rate. If cooling is not sufficient, electronic devices will be damaged.

82
2. Bus Technologies

A bus is a data channel for transferring data from one device to one or more other devices.
The system bus, which connects the various registers in the CPU, is an example. There are
many buses in a computer system.
A bus consists of a number of lines, each of which serves one of the following four purposes
• Data. A data line is binary encoded, and therefore it can carry one bit of data at an
instance.
• Addressing. An address line is binary encoded, and therefore it can carry one bit of data
at a time. The data on an address line represents an address.
• Control. A control line is also binary encoded. The data on a control line represents a
signal. For example, the control unit (CU) sends a signal to the program counter (PC) on
a control line (to invoke the increment function).
• Power. The power at a particular stable voltage supplied by the computer system.
Although the data sent on a bus is said to be binary encoded, there is usually a lower level
encoding scheme to code the binary values 0 and 1 into other signal representation. For
example, the USB encodes binary data with the NRZI encoding scheme, which represents 0
and 1 with transition of two signal states.
Bus Throughput
Bus throughput is the amount of data transfer on a bus per second. Bus throughput is often
called data rate or bandwidth. For example, USB 2.0 data rate is 480 Mbit per second.
Some buses such as the front side bus (FSB) on a PC is rated in term of frequency. The
frequency defines the period required to send 1 unit of data. The relation between frequency
and period is given in the formula:
1
Frequency(Hz) =
Period(s)

For example, a 500 MHz FSB means that the cycle period is 2 ns.
• A high throughput means moving more data in a particular time frame.
• If one data line in a bus can move a unit of data in a cycle, then theoretically a 32-line
data bus can move 32 units in a cycle.
• Basically there are two ways to achieve high throughput: increasing the transfer rate and
increasing the number of lines.

The above diagram assumes that one data line can transfer 1 bit of data per cycle.
83
The following diagram illustrates the benefits of higher data rate and a wider bus.

The following formula shows the theoretical throughput of a multi-line bus.


BusThroughput = NumberOfLanes × DataRatePerLane

Exercise: Bus Throughput

Question: Given that each data line can complete the transfer of 1 bit in 200 ns. Calculate
the throughput if the bus has a total of 32 lines (a 32-bit bus)

Answer:

Data Rate per Line = 1 bit / 200 ns = 5 M bits / second


Bus Throughput = 32 x 5 M bits / second = 160 M bits / second = 20 M bytes / second

Some modern bus systems supports multiple data movement moments in one clock cycle.
For example AGTL+ allows 4 transfers per cycle.

Exercise: Bus Throughput

Question: Given that an AGTL+ is running on a clock rate of 100 MHz, and the bus is 64
bit. Calculate the throughput.

Answer:

Data Rate per Line = 100 M bits / second x 4 transfers = 400 M bits / second
Bus Throughput = 64 x 400 M bits / second = 25.6 G bits / second = 3.2 G bytes / second

In general, bus throughput is dependent on the following factors:


• Data transfer rate
• Number of lines (or bits)
• Overhead of protocols (used in encoding data)
• Distance between connected devices
• Addressing and control
84
Bus Connectivity
Generally there two types of buses: point-to-point bus and a multi-point bus.
• A point-to-point bus carries data from a specific source to a specific destination. For
examples, the bus between the CPU and the Memory system, and the bus leading to a port
for an external device are point-to-point buses.
• A multi-point bus broadcasts data to everything connected to it. The system bus is one
example. The PCI bus is another example that connects many peripherals devices on PC.
Although a multi-point bus connects many components, there is a limitation to the
communication modes. Only one component is allowed to send data out. One or more
components can listen and receive data. Therefore, a single multi-point bus cannot allow two
pairs of communication to happen together.

An expensive alternative is to fully connect all permutations of register pairs.

This facilitates data transfer to occur in parallel and reduces waiting time. However, each
register can only handle one update (store operation) at one time.

Parallel Buses and Serial Buses


Parallel buses have more than one data line or data channels sending data at the same time.
Parallel buses transfer more than 1 bit at a time, as opposite to serial buses. For the multiple
data lines to be synchronized, a clock signal is usually sent on a separate control line.
It seems logical to assume that throughput of parallel buses should be greater. There are more
data lines for data transfer simultaneously and so more bits can be sent at a time. However,
parallel buses suffer from the following problems:
• Clock skew: the signals of different data lines arrive at slightly different times. It may be
due to cable length difference and material difference. For faster clock rate, the smaller
margin for error, and so a little difference can cause errors.
• Crosstalk: the signals between data lines may interfere with each other, causing errors in
the signal. Shielding of cables can help but it will increase size and cost.
Parallel buses are particularly not suitable for connecting devices separated by long distance.
The increased length would make the above problems more severe.
Serial buses have a single data line connecting two components. They allow the sending of 1
bit at a time. Serial buses can work over a longer distance.

85
Recently, serial buses have become the most common form of buses even for short distance
communication. A serial bus running at a significantly faster clock rate can outperform a
parallel bus and at a cheaper price as well.

3. Review of Bus Technologies for PC

The Personal Computer (PC) is a class of desktop computers available at an affordable price
for people to use in home and offices. This section reviews the different types of buses found
in generations of PC.

Peripheral Component Interconnect (PCI)

The PCI bus (Peripheral Component Interconnect) is a standard for connecting a computer to
peripheral devices.
• PCI is a multi-point bus and a parallel bus between the IO controller hub (the south-
bridge) and PCI devices.
• Configuration: clock rate from 33MHz to 66MHz, with data width 32-bit or 64-bit, giving
throughput from 133 MB/s to 533 MB/s.
• PCI supports plug-and-play, and the device interrupt identifier is assigned by firmware
rather than using jumpers.
• PCI has a variant called PCI-eXtended (PCI-X), which runs on clock rate of 133MHz,
giving bandwidth of 1066 MB/s.

Peripheral Component Interconnect Express (PCI Express)

The PCI Express bus is a standard for connecting a computer to peripheral devices.
• PCI-Express is a point-to-point and a serial bus. Data and signals are transferred on
lanes.
• However, link between 2 PCIe devices may operate on different number of lanes,
depending on the need of throughput. High demand applications such as graphics can run
on multiples of PCIe lanes.
• Data transmission on a multi-lane connection is interleaved, that successive bytes are
transferred on different lanes.
• Data rate is around 250MB/s per lane, and a 16-lane connection is capable of around
4000MB/s.
• First-generation PCIe is constrained to a single signalling-rate of 2.5 G bits/s. The figure
of 2.5 GB/s is a calculation from the physical signalling-rate (2500 M baud) divided by
the encoding overhead (10bits/byte). This means a 16 lane (x16) PCIe card would then
be theoretically capable of 250 * 16 = 4000 MB/s (3.7 GiB/s) in each direction.

86
Accelerated Graphics Port (AGP)

The AGP Port (Accelerated Graphics Port) is a bus for connecting video device to computer.
The point of connection is often the primary controller hub to the main memory and the CPU.
• AGP is a point-to-point bus and a parallel bus. It is superseded by PCI-Express bus
already.
• Originally designed as 8-bit bus (at 4.77MHz), and subsequently upgraded to 16-bit (at
8MHz).
• AGP has a variety of speed and size
o AGP 2x: 32-bit, 66MHz, double-pumped (data transfer)
o AGP 4x: 32-bit, 66MHz, quad-pumped
o AGP 8x: 32-bit, 66MHz, eight-time per clock cycle
o AGP 2x throughput: 4 bytes x 66MHz x 2 = 533MB/s

Industry Standard Architecture

The ISA bus (Industry Standard Architecture) is an old standard of computer bus on PC
connecting peripheral devices.
• ISA is a parallel and multi-point bus.
• Originally designed as 8-bit bus (at 4.77MHz), and subsequently upgraded to 16-bit (at
8MHz).
• The EISA improvement extends the bus further to 32-bits (at 8.33MHz) and allowed
more than one CPU to connect to the bus.
• ISA supports DMA and an early version of plug-and-play, which did not perform well.

Advanced Technology Attachment (ATA) and SATA


The ATA bus (Advanced Technology Attachment) is another conventional standard of
computer bus on PC. It connected the IO controller hub (south-bridge) to a hard-disk. It is
also called IDE, ATAPI, or UDMA.
• The IO controller of ATA bus is situated on the hard disk itself, rather than on the
motherboard. Apart from harddisks, many other devices are connected to computer using
ATA including CDROM, ZIPDisk, and tape drives.
• The speed of transfer was improved with DMA and Ultra DMA (UDMA) so that data can
be written directly to memory without the intervention of the CPU. Speed of transfer
depends on the generation of ATA. For example UDMA 100 runs at 100 MB/s.
• This standard is superseded by SATA (Serial ATA). SATA 150 runs at 1.5 GB/s.
• Serial ATA (SATA or S-ATA) is a computer bus technology primarily designed for
transfer of data to and from a hard disk.
• It is the successor to the ATA. This older technology was retroactively renamed Parallel
ATA (PATA) to distinguish it from Serial ATA.

87
• First-generation Serial ATA interfaces, also known as SATA/150, run at 1.5 gigahertz.
Because Serial ATA uses 8B/10B encoding with an efficiency of 80% at the physical
layer, this results in an actual data transfer rate of 1.2 gigabits per second (Gbit/s), or 120
megabytes per second.
• This transfer rate is only slightly higher than that provided by the fastest "Parallel ATA"
mode, Ultra ATA at 133 MB/s (UDMA/133).
• With the release of the NVIDIA nForce4 chipset in 2004, the clock rate of SATA II was
doubled to 3.0 GHz, for a maximum throughput of 300 MB/s or 2.4 Gbit/s.

External Advanced Technology Attachment (eATA)


External SATA or eSATA (131MB/s), the SATA devices can be plugged by shielded cable
lengths up to two meters outside the PC.
• Up to six times faster than contemporary external storage solutions: USB 2.0 and Firewire
(IEEE 1394).
• eSATA is not faster than USB 3.0 (400MB/s) and Firewire S3200 (3200Mbit/s).

USB Bus
USB bus has become the popular means of peripheral connection.
• 1-bit serial
• Hot-pluggable
• USB devices driven by host computer
• USB 2.0 (60MB/s) USB 3.0 (400MB/s), compared to Firewire 400 (400Mbit/s) and 800
(800Mbit/s).

Front-side Bus (FSB)


Front-side bus (FSB) connects the processor to the primary controller hub, which then
connects to the main memory and the IO controller hub. In von Neumann architecture, FSB
plays the key role of transferring both data and instructions from the main memory to the
processor. The performance of front-side bus is most important.
• It is used in Intel processors such as Pentium, Celeron, and Core 2.
• It is a parallel bus, of which the width is normally 64-bits.
• The later versions can perform 2 or 4 data transfer per clock cycle. For example, Pentium
III FSB supports 1 data transfer but Pentium 4 supports 4 data transfer per clock cycle.
Even if both are running on 100 MHz clock, the throughput increased from 800 MB/s to
3200 MB/s.
• FSB is generally regarded as the performance bottleneck of the computer in that
generation. The CPU cannot execute instructions before the instruction and data can be
read in through FSB.
• Intel QuickPath Interconnect (QPI) and AMD HyperTransport now offer superior
technology. For example, QPI provides high-speed serial data transfer between multiple
components similar to network communication. There are multiple data channels allows
two components (i.e. CPU and memory) to separate input and output data flow.

88
4. Memory Technologies

The range of memory technologies for computer spans across many dimensions: speed, cost,
and other characteristics. While a computer designer could ignore the cost issue and choose
the best memory technology, the market would however favour using the memory type fit for
the purpose.
Data stored in a memory system is structured based on a unit of data. The size of the basic
unit of data varies from one type of memory to another. It could be 1-bit, 8-bit, 32-bit, 64-bit
and so on. This is often referred as the word size.
Each data unit is uniquely identifiable with the address of the data unit. The address is an
essential parameter for load/store operations of a memory system

Classes and Hierarchy of Memory


Computer memory has a variety of classes. The classes are summarized in the following
table.
Volatility Volatile Non-Volatile
Volatile memory requires electric Non-volatile memory can
power to keep the stored values retain its value without
electric power.
CPU Accessibility Primary Secondary
Directly accessible by the CPU CPU access through indirect
means of data transfer.
Mutability Mutable Immutable
Read and write allowed Read only
Access Restriction Random Access Sequential Access
Any of the addressable unit can be Memory must be retrieved
directly accessed with a constant in an order
speed

Memory technologies can also be described with the following attributes:


• Cost: the cost of manufacturing and maintenance. In the case where consumables are
involved (such as tapes for a tape drive), the cost of consumables should be included. It
may be measured in dollars per Mbytes.

89
• Compactness: the space occupied by memory can be an important consideration when it
is integrated with other computer components. Usually the smaller the better.
• Throughput or data transfer rate: the time taken to transfer an amount of data. Usually
measured in the same way as throughput in buses (Mbytes per second)
• Latency: the time taken for a memory system to begin data transfer. Some memory
systems (such as CDROMs, hard disks or tape-drives) require a setup time after receiving
instructions to perform read/write operations. Other memory systems purposely add
latency in order to achieve a higher throughput (such as DDR RAM).
The following shows a typical memory hierarchy of a desktop computer. There are varies
types of memory, each with different characteristics and purposes.

Memory in Processors
• Registers in processors are very fast, compact, and mutable memory. The operating
speed must be fast enough to match the internal processor clock rate. It should be
compact and therefore fit into the physical package of processors. Memory technology
suitable for this purpose is costly and volatile.
• Cache memory is processors are used to mirror part of the main memory. If the required
data and instruction is already in processor, then access to main memory through front-
side bus can be avoided. Cache memory should also be fast, compact, and mutable.
Therefore, the cost would prevent the amount of cache memory included.
Primary Memory
• Primary memory is the memory system that is directly addressable from the processor. In
other words, data in the primary memory can be directly referred with instructions.
• Primary memory is often known as the Main Memory system. The Main Memory should
be large size, less costly, and mutable. A large amount of Main Memory is critical for the
execution of programs, especially in multi-programming systems. The Main Memory is
often not physically contained within processor and compactness is not a major concern.
• IO Cache and Buffer is sometimes part of the Main Memory. They can make IO
operations more efficient.
Secondary Memory
• Secondary memory provides long-term data and program storage.
• The demand for capacity is higher given the larger total amount of data handled by a
computer system.
• It is often non-volatile and low-cost. The available technologies for secondary memory
are slower and latency is often not a concern.
• The media for storage determines whether the secondary memory device is mutable.

90
The Main Memory

The Main Memory is the memory system that feeds the processor with instructions and data.
It works closely with the processor in the fetch and execution cycle.
• In the fetch phase, processor needs to load the next instruction from the Main Memory.
• In the execution phase, processor may execute an instruction that involves data from the
Main Memory.

The Memory Address Register (MAR) and the Memory Data Register (MDR) form the
interface between the Main Memory and the CPU.
• The MAR specifies the address of the memory required.
• The MDR holds the data for the transaction.
The MAR and the MDR are connected to the Main Memory in the following manner.
• The MAR holds the address in 8-bits or its multiple (depending on the addressable space).
A decoder converts this 8-bit address into a set of activate lines, which only one is
activated according to the value of MAR. The activated line connects to the memory
cells of the address.
• The memory cells on the activated address line are connected to the MDR. Then MDR
can either read values from the memory cells, or write new values to them.
The following lists the three main buses and signal lines involved in the operation of memory.
• There are usually 32 lines, 64 lines or 128 address lines corresponding to the address size
of 32 bits, 64 bits and 128 bits. The number of address lines is exactly the size of MAR
used in the CPU. One of the activate lines is activated according to the value represented
by the address lines.
• There is usually a R/W line associated with the MAR/MDR to indicate whether this
memory access is a read or a write operation.
• There are also multiple lines connecting the MDR to the cells of each memory address.

91
The following figure shows the design of a basic memory system:

The CPU and the MAR/MDR operates in the fetch phase of the fetch-execution cycle in
following manner.
• The content of the Program Counter Register is copied to the MAR, which is the address
storing the next instruction.
• The R/W line is set to read.
• The content of the given address is stored in the MDR (with previous value overwritten).
The content (instruction) is copied to Instruction Register (IR).
The CPU then examines the instruction stored in the IR to determine the actions to follow. If
the instruction is to store the value of an accumulator to a memory location (similar to the
STO instruction in the LMC), then the following happens in the execution phase of the cycle.
• The address part of the instruction is copied from the IR to the MAR, which is the address
where data is to be stored.
• The R/W line is set to write.
• The data in the accumulator is copied to the MDR. The content of the MDR is stored to
the memory cells activated by the MAR and the decoder.
In modern PC computer systems, the MAR and MDR are part of the Memory Management
Unit (MMU) that also performs other memory related functions.

92
Operation of Main Memory Systems

Operation of the main memory system can occur when the MAR, MDR, and the address R/W
line are loaded with data. The loading of data takes time. The operation is therefore
synchronized with a memory clock so that the loading of data and the operation can take
place at correct timings.
An electronic clock on a computer is a signal that goes between high and low repeatedly.
Memory operation occurs according to the signal, and usually triggered by the edge (rising or
falling edge) of the clock.

The rising edge or falling edge is useful because it represent a time instance that every data
(or signal) involved are ready.
The clock rate (or frequency) has a bearing on the speed of the memory. A slow clock rate
would means that the memory operation occurs less frequently, and therefore slowing the data
movement.
However, we cannot simply increase the clock rate without making other considerations.
Memory operations take time to complete. So the clock rate must allow the completion of
one operation before triggering the next one.

93
Semiconductor Memory: Static RAM and Dynamic RAM

Current main memory system is based on semiconductor memory. A standard circuitry called
flip-flop can store 1-bit of data. A memory system can be designed with millions of these
circuitry integrated together.
• Random access memory (RAM) refers to such memory system in which the stored data
to be accessed in any order.
• RAM based on flip-flops is called static RAM (SRAM). SRAM is fast and non-volatile
as long as powered and volatile if there is no power.
• Each flip-flop is made up of 6 to 8 transistors, which can take up some space if larger
memory size is to be packaged.
Packaged RAM chip is available in various standard shapes. The manufacturing process of
RAM is similar to that of semiconductor microprocessors.
An alternative technology is called Dynamic RAM (DRAM).
• Dynamic RAM (DRAM) store data as charge on a capacitor, arranged in an array or table
of cells. So the array of cells provides the storage of multiple bits of data.
• The capacitors used in DRAM tend to lose their charge quickly, and therefore require a
periodic refresh cycle (in milliseconds) or data will be lost. Therefore a memory
subsystem is required to support this refreshing.
• DRAM is less expensive, requiring more power, smaller in size, compared to SRAM.

94
Improving the Throughput

DRAM is significantly slower than SRAM. A processor may have to wait for 4 to 6 cycles
before DRAM can make the data available.
There are variants of DRAM that are designed to provide better data throughput through some
clever designs.

Extended Data Out RAM (EDORAM)


EDORAM allows read and write operations to happen in a single cycle. This could double
the throughput.

The above example shows that one memory operation requires two cycles (the real case
should be more). The first cycle is REQ, and the second cycle is READ.

EDO RAM supports overlapped REQ/READ cycles.


• Each memory operation still requires two cycles, but each cycle two operations are
happening: READ for the last operation and REQ for this operation.
• EDO RAM uses latches to cache the data of the previous operation. The data is available
even after the computer starts specifying the address of the next location to be read. The
data remains there for reading.
• Single cycle EDO RAM can carry out a memory access in 1 cycle, which other DRAM
needs 2 to 3 cycles.
Burst EDO RAM further improves on EDO RAM as it allows 4 x REQ addresses at the same
time. The latches are sufficiently large to cache all the data.
• Four consecutive addresses are requested. This exploits the common phenomenon of
locality of references. Consecutive addresses in memory are often assessed one after
another.

95
Video RAM (VRAM)
VRAM is designed for video adapters. VRAM systems are dual-port. It allows
simultaneously read and write operations.
• RAM is normally a single port device. The CPU can perform reading or writing but not
both at the same time.
• With VRAM, the PC can write into the memory to change what will be displayed, while
the video adapter continuously reads the memory to refresh the monitor's display. The
performance is greatly increased.
Early PCs supported 64K for video RAM, not 256 K. This might surprise you as the video
card has 256K video RAM size. To fit this RAM into the 64K space, the RAM is paged.
Programs can only access a small portion (or PAGE) of video RAM area at a time. Some
newer cards map their entire memory directly into the PC's RAM space in high memory
(above 1024K) hence creating a video aperture. Only Windows-based operating systems, not
DOS, can support such cards.
This technology is superseded by DRAM technology.

96
Double Data Rate DRAM (DDR-RAM)
DDR RAM is the current mainstream memory technology.
• The DDR DRAM serves data in the beginning and ending phase of a memory cycle,
therefore serving double amount of data.
• A bus frequency of 100 MHz can allow a single channel DDR RAM to serve 1.6 GB/s.
• PC-1600 64-bit DDR RAM using DDR-200 chips runs on 100MHz bus has a single
channel output of 1.6GB/s.
o Transfer rate = 100 MHz (memory clock rate) x 2 (for dual rate) x 64 (number of bits
transferred) / 8 (number of bits/byte).

DDR RAM makes use of both rising and falling edge - double pumped.
• DDR-200 (PC-1600): 100MHz = 1.600 GB/s
• DDR-266 (PC-2100): 133MHz = 2.133 GB/s
• DDR-400 (PC-3200): 200MHz = 3.200 GB/s
DDR2 DRAM series allows internal memory clock to run faster than bus clock - fetch double
more data on average in one cycle.
• DDR2-400 (PC2-3200): 100MHz (Memory) = 200MHz (IO) = 3.200 GB/s
• DDR2-800 (PC2-6400): 200MHz (Memory) = 400MHz (IO) = 6.400 GB/s
• Higher latency that makes DDR2-400 performing worse than original DDR
DDR3 DRAM series allows internal memory clock to run faster than memory clock - fetch
quadruple more data in one cycle.
• DDR3-800 (PC3-6400): 100MHz (Memory) = 400MHz (IO) = 6.400 GB/s
• Even higher latency that results in long access time.

EEPROM and Flash Memory


EEPROM (Electrically Erasable Programmable Read-Only Memory) and flash memory are
non-volatile memory allowing rewriting of data. It uses a technique called Fowler-Nordheim
tunnelling that allows the trapping of electrons within insulator to record the state of a bit.
Flash memory is the latest form of EEPROM, allowing faster rewriting of data of a whole
blocks.

97
5. Input and Output Devices

IO design is an often-overlooked issue. Many people are more concerned about the CPU
speed. However, the performance of a computer system often rests on IO performance. IO
devices are significantly slower than CPU and memory speed.
There are many attributes separating one IO device from another. The following lists the
major characteristics:
• Transfer data unit: character-stream or block transfer.
• Relation between IO operations and programs: synchronous or asynchronous data
transfer.
• Data access order: sequential or random access.
• IO device exclusiveness: Sharable or dedicated device.
• Data mutability: read and write allowances.
• IO device latency: fast or slow setup time.
• IO operation speed: high or low data transfer rate.
For example, a keyboard is a character-stream, asynchronous, sequential, dedicated, read-
only, fast setup, and low data transfer rate IO device.
On the other hand, an electro-mechanical hard disk is a block transfer, asynchronous, random
access, sharable, read and write, slow setup time, and high data transfer rate IO device.

98
IO Operation and Latency

Latency is a major issue in IO operations. The following diagram explains the detail stages of
IO operations.

Many IO devices are mechanical-electronic devices


• Mechanical-electronic devices will rest to save energy if there is no action required.
• From a device changing from a resting state to a ready state would take time (for example
switching on the motors). This is called the initial setup time. The time required could
be quite long.
• After the device is ready, then data transfer can occur and the rate of transfer depends on
the device.
• The device will then complete the data transfer and wait for another request.
• Upon receiving another request, the device will still take some time to begin the data
transfer. This is called the latency time.

99
Synchronous and Asynchronous IO

Two possible methods of handling IO exist.


• Control does not return to user program until IO is completed.
• Control returns to user program after registering with the request. The user program is
notified of the completion of the IO with interrupts.

The first method is known as synchronous IO and the second method asynchronous IO.
We need different architectures and services to handle these types of IO.
• Advantage. Asynchronous IO allows the user program to do something else while the IO
device is handling the request.
• Disadvantage. Asynchronous IO is more difficult to for the programmer to write program
to manage exceptional situations, such as an error occurring while the program is doing
something else.

Characteristics of synchronous IO include the following.


• Wait instruction idles the CPU until the next interrupt
• Wait loop (contention for memory access).
• At most one IO request is outstanding at a time, no simultaneous IO processing.

Characteristics of asynchronous IO include the following.


• System call, which is a request sent to the operating system to allow user to wait for IO
completion.
• Device-status table contains entry for each IO device indicating its type, address, and
state.
• Operating system indexes into IO device table to determine device status and to modify
table entry to include interrupt.

Asynchronous IO allows concurrent IO operations to more than one device. A device-status


table is needed to book-keep the state. Each entry includes device type, address, and state.
The OS also maintain a wait queue for each IO device.

100
IO Design for Computer Systems

IO devices are running at a significantly slower speed compared to the CPU and the Memory
System.
The basic design strategy is to separate them into different worlds of speed, in the same way
as the Memory System is separated from the CPU.
The following figure shows how IO devices are connected to IO controller hub. The hub is
then connected to the memory controller hub, before reaching the CPU. The bus speed is
decreasing as the bus moves further away from the CPU.

IO devices are connected to the bus leading out from the IO controller hub.
• Each device controller handles a specific type of device. Sometimes one controller can
manage more than one device (such as SCSI).
• Each device controller has a local buffer, which is used to hold data when data is
reading/writing between the computer system and a device.
• Device controllers operate independently from the CPU.

CPU instructs the device controller by loading the appropriate registers.


• Device controller examines the registers to see what instruction is there.
• The device controller notifies the CPU of the completion of the instruction by triggering
an interrupt.
• CPU can move the data to/from local buffer to main memory.

101
Sending instructions to IO devices

The IO device is now separated from the CPU by at least two controllers in the current
programmable computer design. The CPU cannot directly send data or signals to IO devices.
While the CPU can directly send data to the Memory System through the MAR/MDR
registers, there is no such mechanism built in for IO device.

Port-mapped IO
Port-mapped IO uses dedicated instructions for IO operations. An example is the LMC
instructions IN and OUT. These instructions are handled directly by the CPU and the CPU
sends signals directly to the IO devices to carry out the operation. The IO devices have their
own memory space for data movement between the CPU and the IO device.

Memory-mapped IO
Another solution to this problem is to provide memory-mapped IO. Memory-mapped IO
unified the access to the Memory System and access to IO devices. Sending signals or data to
an IO device is now done by writing data to the Memory System. Certain areas in the
Memory System are declared special places. Any data written to one of the areas is read and
handled by an IO device. On the other hand, an IO device sending data back by writing to the
areas.
The following figure illustrates how memory-mapped IO operates.

CPU transfer data to an IO device by writing instruction to the mapped data registers and set
the control register appropriately. The device controller monitors the control register, takes
the data, and then clears the control registry for next data transfer.

102
Signalling from IO Devices

The CPU has one of the two following options after sending an instruction to an IO device:
• Synchronous IO mode. The CPU will wait and keep polling if the IO operation is still
going on.
• Asynchronous IO mode. The CPU will forget about the IO operation for the time being
and does something else.
In asynchronous IO mode, the CPU will wait for a signal from the IO device when the IO
operation is complete or an error has occurred. The signal is known as interrupt.
IO Interrupt is handled in the following steps.
• After the CPU receives an interrupt signal, the execution of the current program is
suspended.
• The program counter (PC) is saved so that the execution will return to the suspended
place later.
• The controller then refers to the type of the interrupt and the CPU is made to execute a
segment of code according to an interrupt vector.
• Interrupt vector is a table of pointers usually stored in the lower part of the memory.
• The pointers are the starting addresses of interrupt service routines (ISR) that are
designed to handle a particular type of interrupt.
After the interrupt is handled, the CPU is made to return to executing the address of the
program when the interrupt occurred.
The following figure illustrates the steps involved in handling an IO interrupt.

103
6. Input and Output Device Case Study: Hard Disk

Hard disk is arguably the most important IO device of modern computer systems. Hard disks
are currently based on magnetic disks technology, which has been serving us since 1960's
despite facing a lot of challenges from other technologies.
Hard disks contribute to modern computer systems in two major ways.
• Provide non-volatile long-term storage for data and files.
• Provide secondary memory to supplement the main memory during the operation of
computers. Data from the main memory can be moved temporarily to hard disk to spare
some space for other data.

Structure of Magnetic Disks

Magnetic disks are physically composed of platters of solid disks stacked up on a rotational
spindle.

Platters are usually made of metal or glass, deposited with magnetic materials on both sides.
So one platter has two surfaces for data storage. Each platter is divided into a number of
cylinders, and then tracks. Typically there are tens of thousands of tracks on a platter.
Each track is further divided into sectors. A sector is the smallest unit involved in read/write.
Because the outer tracks are longer, usually more sectors are designated there. This scheme is
called constant bit density.
To perform a read/write operation, a moving arm with a read/write head is moved over the
track of the desired sector. This is called a seek operation and the time required to move the
arm is called the seek time.
After the read/write head has moved to the desired track, it may not be over the desired sector.
The time for the head to move over the desired sector is called rotational delay or rotational
latency. It needs to wait until the rotation of the platter. If the platter is not already moving,
further delay would be taken into consideration.

Read/Write Performance
The time taken to read/write data on magnetic hard disks must take the following overhead
into consideration.
• Seek time. The time taken for the read/write arm to move over the desired track.
• Rotational Delay. The time taken for the desired sector to be rotated under the read/write
head.
• Controller time. The time taken for the IO controller to process an IO request.
• Queuing time or queuing delay. A hard disk can serve one request each time. Other
requests must wait and queue for the service for the hard disk.
104
A typical performance specification of a modern magnetic hard disk is shown in the
following.
Fujitsu Hard Disk MHT2160BT
Model MHV2160BT
Storage capacity(formatted) 160.0 GB
Bytes/Sector 512
Seek time Track to track 1.5 ms typ.
Average 12 ms typ.(Read), 14 ms typ.(Write)
Maximum 22 ms typ.
Rotational speed 4,200 RPM
Data transfer to/from host 150 MB/s
Interface SATA
Buffer size 8MB

Data obtained from


http://www.fujitsu.com/tw/services/computer/harddisk/MHV2160BT.html

Exercises: Read/Write Time


Question: Calculate the time taken in a read/write operation on the hard disk. Given that
the size of one sector is 512 bytes. The controller time is 0.1ms. The disk is free of queue
and it is available.

Answer:
The time taken to transfer 512 bytes can be worked out as the following.
Average disk access is the sum of the following: average seek time, average rotational
latency, transfer time, and controller time.
Average seek time = 12 ms
Average rotational latency = 50% × (1 / 4200 RPM) × 60 = 7.1 ms
Transfer time = 512 bytes / 150 MB/s = 0.003 ms
Controller overhead = 0.1 ms
Overall average disk access = 12ms + 7.1ms + 0.003ms + 0.1ms = 19.2ms
Note that the data transfer (0.003ms) only contributes a small percentage of the average
disk access. The various seek time (including the average seek time and rotational latency)
factors are predominant
To speed up the disk access, we should first focus on the seek time which contributes 90
percent of the overall disk access time.

To minimize seek time, one can read more data than request, and hoping that the next
requests happen to use the data read-ahead. The success of read-ahead lies on the
observation that requests has a property of spatial locality. When data is stored on hard
disks, the data is arranged in a sequence on neighbouring sectors and tracks.

105
7. Integration: PC Motherboard Design

The motherboard is a printed circuit board where the major computer components are
integrated together. In addition to the circuitry connecting the components, it also provides
power, cables, connectors, and physical housing.
A motherboard is designed around the features of the main processor. Apart from the main
processor, it contains a chip set for providing other functions such as memory and IO control.

Classic PC Motherboard Design


The following shows the classic PC Motherboard design for Pentium series processor.

There are two chips that work with the CPU: one is called memory controller hub (north-
bridge) and the other IO controller hub (south-bridge).
• The north-bridge connects the CPU and the following components: main memory, AGP
bus (video), and the south-bridge. The north-bridge determines the performance of the
data transfer between the memory and the CPU (often the deciding factor in system
performance), and the type and amount of memory that can be used.
• The south-bridge is detached from the CPU, and it is responsible for handling slower
communications. The separation of the two allows the critical high speed transfer
between the CPU and memory to happen without the intervention of the slower
communication between peripheral devices, which is the main role of the south-bridge.
• The south-bridge includes an interrupt controller that allows peripheral devices to alert
the CPU.
• The south-bridge also includes a DMA controller that allows data transfer between IDE
hard disks and the main memory.
• The CPU and the north-bridge are connected with a high-speed bus known as the front-
side bus (FSB). The speed of CPU is determined from the speed of the front-side bus
times a multiplier. Example Intel technology for FSB is known as GTL+ and AGTL+.
The CPU connects to the L2 Cache Memory through the backside bus.
The FSB is often regarded as the bottleneck of performance of this classic PC design. The
all-important memory to CPU data transfer running on the FSB is shared with data write-
back, and data of IO operations.
106
Motherboard for Core Duo Processor
The following diagram is adapted from Intel information, showing the specific components
and the bus data rates of a motherboard design for Intel Core Duo processor.

Core 2 Duo SATA x 6


Processor
Audio
FSB
10.6 GB/s 3 Gb/s
each
6.4 GB/s to 2 GB/s
DDR2 or 8.5 GB/s DMI
Memory I/O
DDR3
Controller Hub Controller Hub
Memory
P45 ICH 10R

16 lanes 480 Mb/s 500 Mb/s


16 GB/s

PCI Express USB 2.0 x 6 PCI Internet


2.0 Graphics 16 Express x MAC
1

Internet
LAN

Motherboard for Core i7 Processor


Core i7 is the current top of the range processor offered by Intel. The following diagram is
adapted from Intel information, showing the FSB is now replaced with Quick-Path Inter-
Connect (QPI).
The memory controller is within the Core i7 processor, allowing a direct access to DDR3
DRAM.
8.5 GB/s
DDR3
Core i7 SATA x 6
8.5 GB/s Processor
DDR3 Audio
QPI
25.6 GB/s 3 Gb/s
each
2 GB/s
IO Hub (IOH) DMI I/O
X48 Controller Hub (ICH)
ICH 10R

32 lanes 480 Mb/s 500 Mb/s


16 GB/s

PCI Express USB 2.0 x 6 PCI Internet


2.0 Graphics 16 Express x MAC
1

Internet
LAN

107
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 7. Case Studies on Features Improving Performance

This chapter will discuss a number of case studies, each of which examines a particular
feature used to improve computer system performance.
• Increasing Clock Rate
• Adding a CPU Cache
• Adding General Purpose Registers
• Adding an addition System Bus
• Direct Memory Access (DMA)
The following three approaches can summarize the features for improving computer
performance.
• Performing a task faster.
• Performing tasks in parallel.
• Avoid performing certain time consuming tasks.

108
1. Increasing Clock Rate

Increasing the clock rate is a simple method to improve the execution performance. An
increased clock rate can make more actions to take place in the same period of time. For the
same number of RTL steps, doubling the clock rate theoretically reducing the time taken by
half.
There are several limitations concerning increasing clock rate to improve performance.
• All components have operating ranges and the clock rate is one of them. For example,
the Memory System might need at least 5 ns to complete a read/write operation, and so
the clock cycle cannot be shorter than 5 ns (or the clock rate cannot be greater than 1/5ns
or 200 MHz)
• Components running at high speed generate heat. A too high clock rate may generate
heat beyond the capability of the designated cooling device.
• Components running beyond the designated speed shorten the lifespan.
Processor

Frequency
System Clock Multiplier
1.33 GHz
x 10

133 MHz

CPU Multiplier

Memory and FSB

Over-clocking is carried out by enthusiasts to boost their computer performance. The above
diagram shows that there is a system clock driving the clock signals for other components.
Modern processors have a frequency multiplier that multiplies the system clock for the
internal processor clock. The ratio between the system clock and the internal processor clock
is the CPU multiplier.
• Increasing the system clock drives up both the processor and the memory system.
• Increasing the CPU multiplier can achieve over-clocking in the processor only.
• To improve component stability under over-clocking, increasing the operating voltage of
the component may help.
• The increased clock rate will cause more heat generation. Additional cooling should be
installed to help dissipating the heat.
Over-clocking can potentially damage computer components. It may also cause errors in
computation due to lower stability of the components. The processor performance gained is
actually not that significant.
This section is a theoretical discussion of over-clocking and it does not teach you how to
over-clock.

109
2. Adding CPU Cache
The clock rates for CPU and main memory are significantly different. More importantly, the
potential throughput is also significantly different. Increasingly the main memory has
become a major performance bottleneck. The throughput main memory cannot satisfy the
fast processor’s demand for instructions and data.
CPU cache is a feature that reduces the processor’s dependency on the throughput of the main
memory.

Processor-Memory Performance Gap


In the past, the CPU speed has doubled around every 18 months (Moore's Law) but the
Memory speed could only double in ten years time. This phenomenon is often called
Processor-Memory performance gap. Memory speed increment is hampered by several
factors:
• There are memory technologies that are fast but expensive. Increased demand in memory
size requires most computers to use more economical types of memory as its primary
memory.
• The large memory size requirement complicates address decoding and handling.
• The large memory size requirement also complicates physical packaging and layout of
memory chip on printed circuit boards.
• The large memory size requirement makes it physically impractical to closely integrate
with the CPU.
• The bandwidth between chips is limited.
For the last point, refer to the following figure.
The figure shows a motherboard where the CPU, Memory and other components are
physically laid down. The CPU and the Memory are connected by inter-chip communication
and because of physical limitation the bandwidth cannot match the data movement speed
within the CPU.

The concept of Memory Wall was postulated by Bill Wulf and Sally McKee in 1994. They
predicted that the divergence of CPU speed and Memory speed increment would soon make
all computer performance to be dominated by Memory speed.

Reading: Memory Wall


http://www.acm.org/crossroads/xrds5-3/pmgap.html

110
Memory Operations

The very first design of programmable computer assumes that all the components, including
the CPU and the Memory, are operating at the same clock rate.
Look at the following fetch part of a LMC instruction in RTL. The CPU expects the Memory
to have the data ready at the MDR in one clock cycle.

ADD 20

PC > MAR
M[MAR] > MDR ; expects the memory to have the data ready in one clock cycle
MDR > IR

Consider that the CPU technology has made advancement and it can operate at a faster clock
rate. However, the Memory System's speed has remained the same. With our current design,
the CPU has to operate at the same speed as the Memory, therefore unable to exploit the
improvement in CPU performance.

Separating Memory from Processor

A solution to this problem is to separate the two buses and to allow them running at different
speeds. A bus controller is placed between the two buses to perform coordination of data
movement between the two.

The CPU can now run at a faster clock rate, but there is another problem. The program
instructions are still located at the Main Memory. During the execution of each instruction,
the CPU still has to wait for the slower Memory System to respond. A solution is to prevent,
as far as possible, the CPU to access the Memory System.

111
CPU Cache

CPU memory cache is a very fast memory subsystem located within the CPU. The act of the
CPU cache making a copy of the data in the main memory is called caching. Caching is a
technique that allows the CPU to operate more independently of the Memory System. If the
required data or instruction is already within the CPU, the CPU needs not read from the
Memory System.
The following figure describes a scenario of how caching would allow the CPU to operate
independently.

Consider that the computer is executing a small LMC program that involves a loop.
• As the execution begins, the CPU reads from the Memory System each instruction one by
one.
• The CPU needs to wait every time for the slower Memory System to catch up.
• The key operation is that the CPU is saving each instruction in the CPU memory cache.
When the loop is executed the second time, the instructions to be executed are already
available in the CPU memory cache. The CPU can read directly from the CPU cache and
can operate with fewer accesses to the Memory System.

Performance of CPU Cache

The ideal CPU cache is one that always stores the required data (or instruction). Memory
operation is not needed at all.
The cache hit rate is the percentage of requests that can be satisfied by the cache. The ideal
CPU cache would have a hit rate of 100%.
Theoretically a high hit rate can be achieved by:
• Larger cache size.
• More accurate predictions about the future data requests. Keep such data in the cache.
A drawback of larger cache size is higher latency due to search operation. A search operation
in the cache would be involved with every memory operation. Although the cached data is
organized efficiently in a data structure, the search time still increases with larger cache size.
Multi-level cache is designed to provide a better balance between cache size and latency. A
lower level cache (e.g. L1 cache) is smaller in size and so latency is low. The second level
cache is of a larger size, and subsequently higher level cache has increasing size. The search
begins with the low level cache first.

112
Locality of Reference

CPU cache manager can often make correct predictions about future requests. Even there is
no crystal ball.
The locality of reference is a phenomenon that the next memory request is more likely to be
at a nearby address than at any other addresses. For example, if the current memory request is
at address 1000, then the chance of requesting address 1001 is higher than any other
addresses.
The locality of reference exists because of some properties of program execution, including:
• Sequential execution model: usually the next instruction to execute is at the next address.
• Local branching: even if a branching occurs, the address to branch would be nearby if
branching is associated with a condition (if-else) or repetition (for-while) structure.
• Array processing: data in an array is usually assessed from one end to another end.

Coherency in Caching

With caching, a piece of data would appear in several copies in different storage facilities.
Modification to the value in one place would render the other places incorrect. This is
actually acceptable provided that there is only one process accessing the data.
In multi-programming environment, this is a potential problem because there could be another
program accessing the other copies of the same data. This happens when the CPU switches
from one process to another. The situation is more complicated in multi-processor
environment, and in a distributed environment. There are many methods to guarantee
consistency but each of them requires effort to manage the data copies.

3. Adding General Purpose Registers

Registers are at the top of the memory hierarchy and they are very fast memory.

Types of Registers

General purpose registers are also called user-visible registers. They can be directly
manipulated with instructions, such as storing/loading data for such registers.
The LMC has only one general purpose register, the accumulator (ACC). It can be
manipulated with LDA, STO, ADD, SUB, IN and OUT instructions.
Specific purpose registers are used to support the operation of the CPU in the fetch and
execution cycle. Examples of specific purpose registers include the following:
• Program Counter Register (PC) for storing the program counter.
• Instruction Register (IR) for temporarily storing the instruction loaded from the memory.
• Memory Address Register (MAR) for holding the address of a memory location where
data may be loaded or stored.

113
• Memory Data Register (MDR) for holding the data involved in a load/save operation with
a memory location.
• Status Register for holding the various statuses during the operation of the CPU such as
arithmetic errors (e.g. overflow or carry, low power, etc). A status is indicated with a
flag, often 1-bit wide. For example, the 8086 and 8088 chips have a 16-bit status register,
storing the following flags: had many flags: carry, parity, auxiliary carry, zero, sign, trap,
interrupt enable, direction, overflow, IO protection, nested task, resume, and virtual 8086
mode.
• IO Registers for holding the data and identity of the IO device. This is not often used in
modern architectures.
• There are other special purpose registers such as Constant Registers for holding special
value such as zero and one.

Role of General Purpose Registers

General purpose registers located in the processor are useful for storing intermediate and
temporary data. Most processes and operations involve a lot of steps. Each step would
consume data from the previous steps and generate data for the next steps. A programmer can
write instructions to store these data in general purpose registers.
LMC has only one general purpose register. A lot of memory operations are found in LMC
programs because intermediate data can only be stored in the main memory. There is no
spare register for storing these in the processor.
Memory operations are slow and the performance would be significantly improved if memory
operations related to intermediate data could be avoided.
The following LMC program shows such an example.

Example: LMC program


This program reads two integers and prints the larger integer.
The instructions STO, SUB, and LDA involve loading or storing data from/to the Memory
System. It is because the CPU has only the ACC and no other place to store intermediate
data.
00 IN
01 STO 11
02 IN
03 STO 12
04 SUB 11
05 BRP 08
06 LDA 11
07 BR 09
08 LDA 12
09 OUT
10 COB
11 DAT
12 DAT

The data movement between CPU and Memory System can be reduced with more general-
purpose registers in the CPU.
114
Adding General Purpose Registers

Three new general-purpose registers to the LMC, named R1 to R3, are added to improve the
efficiency. The old accumulator ACC is renamed to R0. These three new registers are
connected to the CPU system bus. The following figure shows the revised design.

The new generation purpose registers are not usable without adding new instructions for
manipulation them. The following describes two new LMC instructions:
• MOV RA, RB: copy data from Register B (RB) to Register A (RA).
• SUB RA: subtract RA from R0 and store the result to R0.
The following shows the RTL steps for the two new instructions.
MOV RA, RB SUB RA

PC > MAR PC > MAR


M[MAR] > MDR M[MAR] > MDR
MDR > IR MDR > IR
PC + 1 > PC PC + 1 > PC
R[B] > R[A] R0 – RA > R0

The two new instructions require 5 RTL steps. If each step takes one clock cycle, the
instruction takes 5 clock cycles.
• They take two less RTL steps compared to the original LMC SUB instruction, which
takes 7 RTL steps.
• Memory operations may take more than one clock cycle, and so comparatively two new
instructions are even faster because they carry out fewer memory operations.
With the new instructions MOV and SUB, the LMC program is rewritten as the following to
exploit the new general purpose registers.

115
Example: Revised LMC program

The instructions STO, SUB, and LDA involve loading or storing data from/to the Memory
System. It is because the CPU has only the ACC and no other place to store intermediate
data.

00 IN ; #1 store in R0
01 MOV R1, R0 ; R1 = R0
02 IN ; #2 store in R0
03 MOV R2, R0 ; R2 = R0
04 SUB R1 ; R0 = R0 – R1
05 BRP 08 ; if R0 >= 0
06 MOV R0, R1 ; R0 = R1 R1 stores #1
07 BR 09
08 MOV R0, R2 ; R0 = R2 R2 stores #2
09 OUT
10 COB

The revised program should perform better. The program is shorter and some instructions
also take shorter time to execute.

Example: Performance Evaluation

Question: Compare the execution of the two programs and evaluate the performance gain.

Answer:
A number of quantitative measurements can be used to compare the two programs. Two of
them will be used here: RTL steps (similar to clock cycles) and memory operations.
Normally, all instructions that would have been executed in the programs are taken into
consideration. The programs have no loop and making it easier.

The program has a conditional branch. For simplicity, only the case of first integer greater
than second integer is considered.

Original Program Revised Program

Number of instructions 11 11

Number of RTL steps 5+7+5+7+7+4+7+4+7+5+3 5+5+5+5+5+4+5+4+5+5+3


= 61 RTL steps = 51 RTL steps

Number of memory 1+2+1+2+2+1+2+1+2+1+1 1+1+1+1+1+1+1+1+1+1+1


operations = 16 memory operations = 11 memory operations

116
4. Parallel Execution and Adding another System Bus

Performing tasks in parallel can certainly shorten the time to complete. However, parallel
processing must satisfy the following requirements:
• The tasks are independent. For example, if task B depends on the result of task A, then A
and B cannot be performed together.
• Additional resources are available for the parallel execution of the tasks.

This case study investigates the effect of parallel execution of RTL operations.
Consider that we have an instruction ADD R1, R0, R2 that adds R1 to R0 and the result is
stored in R2. This is a register-based instruction. In the execution phase, all data movement
happens on the system bus inside the CPU. The system bus design is shown in the following.

The RTL for the instruction is given below. There are 6 steps and it takes 6 clock cycles to
complete the execution (assume 1 clock cycle per RTL step).

ADD R1, R0, R2

PC > MAR
M[MAR] > MDR
MDR > IR
PC + 1 > PC
R[0] + R[1] > R[0]
R[0] > R[2]

If the 6 RTL steps could be executed in parallel, then it would just take 1 clock cycle to
complete the execution. There are a few reasons why this is not possible.
• A RTL operation is dependent on the result of a previous RTL operation. For example,
the second RTL operation requires the loading of MAR from the first RTL operation.
These two operations cannot happen together.
• Hardware design restricts the possible parallel operation. The last two RTL operations
require the system bus and so they cannot happen together.

117
The multi-point bus can only support one pair of components to communicate at a time. Let
us examine which of the steps use the system bus. Step #1, #3, #5, and #6 happens on the
system bus.

RTL step Location


PC > MAR System Bus
M[MAR] > MDR Memory
MDR > IR System bus
PC + 1 > PC Control Unit Signal
R[0] + R[1] > R[0] System Bus
R[0] > R[2] System Bus

If step #2 and step #4 do not use the system bus, the problem is to consider whether it is
possible for them to happen in parallel with any other steps.
• Step #2: Not possible because step #2 cannot occur before completion of step #1. Also,
step #3 cannot occur before completion of step #2. There is data dependency between the
first 3 steps.
• Step #4: The increment of PC is caused by a signal from the Control Unit. This step can
happen in parallel with step #5.

The following shows the timing information of the execution of the instruction. The
instruction now takes one less time cycle to complete.

Adding one more system bus can increase the opportunity for parallel execution. The
following figure shows a possible design based on two system buses.

The output port of the ALU should be connected to both system buses to facilitate the
movement of data. However, the two system buses cannot be connected because they should
be carrying different data. A pair of control gates is placed at the output port of the ALU to
control which system bus is connected to the output port.

118
With the additional system bus, the last two steps in the instruction ADD R1, R0, R2 (in
below diagram) can happen in parallel.

RTL step Location


PC > MAR System Bus #0
M[MAR] > MDR Memory
MDR > IR System bus #0
PC + 1 > PC Control Unit Signal
R[0] + R[1] > R[2] System Bus #0 and System Bus #1

Adding an additional system bus provides opportunities for parallel execution of the steps in
the fetch and execution cycle. There is cost implication of adding a bus, however, and the
designer must weigh the cost and benefits.

119
5. Direct Memory Access (DMA)

Direct memory access is a technique that allows IO-to-Memory operation to occur in parallel
with processor execution of instructions.
IO-to-Memory operations occur quite frequently in modern computers:
• Loading programs from hard-disk to the main memory before execution.
• Loading data for program processing.
In the current computer design, the CPU needs to take care of IO operations through handling
interrupts, even in asynchronous IO operation mode.
• CPU executes an instruction to initiate an IO operation.
• CPU continues to execute other instructions, and leaving the IO operation to run in
parallel.
• IO operation completes and raises an interrupt.
• CPU suspends the current execution and handles the interrupt.
• After the interrupt is handled, the CPU resumes the suspended execution of instructions.

For a busy high speed device handling many requests, there will be too many interrupts. Each
interrupt will hamper the smooth operation of the CPU and the CPU is forced to do a lot of IO
handlings instead of executing programs.
One solution to free the CPU from IO activities is to allow the IO devices to communicate
with each other independently.
• High-speed devices use a method called direct memory access (DMA), in which device
controller transfer a whole block of data directly between the main memory and the
device local buffer.
• Only one interrupt is generated per block.
• A DMA controller is instructed by the device driver (in the OS) of the address of a buffer
(in the main memory) and the length of data to copy. CPU can do other things
independently.
• The major problem is the Memory System can serve only one request at a time. DMA
still competes with the CPU for memory system access.
The following figure shows the operation of DMA.

120
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 8. Instruction Set Architecture and Addressing Modes

Instruction set architecture concerns with the programming aspect of computer architecture or
computer design. A computer designer can include many good features that allow a computer
system to run efficiently. However, the programmers must be able to make use of these
features through programming.
The instruction set can determine the efficiency of programs through the skills of
programmers.
This chapter discusses instruction set architecture through Extended LMC (e-LMC). E-LMC
provides an instruction set for illustration of various concepts related to instruction set design.

1. Overview

Instruction set architecture (ISA) is related to the following issues:


• Function or type of instructions. Common types include:
o Arithmetic: add, subtract, negation, etc
o Condition and logic: conditional and relational operators, etc
o Branch: changing the address of next instruction to execute
o Data Load/Store: load/store a general purpose register or a memory address
o Input/Output
• Operands of instructions. Many instruction types would require specification of one or
more operands.
o Arithmetic operators typically require three operands: two operands for the
operation (the A and B in A + B) and one operand for specifying the
destination of the result.
o Operands may be implicit. Implicit operands are not included in instructions.
For example, the LMC ADD instruction has one implicit operand, which is
the ACC. Only the memory address operand is included in the instruction.
For example, ADD 20 will add data in memory address 20 to ACC and the
result is stored in ACC.
• Length of instructions. The length of instruction is the memory size occupied by the
instruction.
o All LMC instructions are of the same length, occupying one memory
location. The LMC instruction set is known as fixed-length instruction set.
o An instruction set that contains instructions of varying lengths is called
variable-length instruction set.
o Longer instructions generally occupy more main memory and take longer to
read into the processor.

121
• Format of instructions. Generally instructions and data are of the same representation in a
von Neumann computer. The LMC instructions are in the form of 3 decimal digitals.
o A typical instruction has two items to be fit into the representation: operation
code (op-code) and operands.
o For example, the first digit of LMC instructions is the op-code. However, if
the op-code is 9, then all three digits form the op-code. The format is said to
be variable.
The main purpose of instruction is no doubt processing data. An important consideration in
instruction set architecture is to determine how a processor stores data.
There are generally three types of architecture concerning data storage in a processor:
• Stack architecture. Data and instructors are stored in a stack in the processor.
• Accumulator architecture. Data is mainly stored in the accumulator (ACC).
• General-purpose register architecture. Data is stored in one or more general-purpose
registers in the processor.

2. LMC Instruction Set Review

The following lists the LMC instruction set.

Instruction Code Remarks


Load 5XX Load from mailbox to calculator
Store 3XX Store in mailbox from calculator
Add 1XX Add from mailbox to calculator
Subtract 2XX Subtract from calculator the mailbox value.
Input 901 Input
Output 902 Output
Halt 000 Coffee Break or Halt
Branch 6XX Branch unconditionally
Branch if Zero 7XX Branch if zero
Branch if Positive or Zero 8XX Branch if positive or zero
Data A location for data storage

LMC instruction set architecture can be summarized as follows:


• Functions and types: there are 2 Load/Store, 2 Arithmetic, 2 Input/Output, 3 Branch, and
1 Misc operators.
• If the operand is the accumulator (ACC), it is implicit. Memory address operands are
explicit in the Load/Store/Add/Subtract/Branch instructions.
• LMC instruction set is fixed length. Every instruction occupies one memory location.
• LMC instruction set follows a format with small variation. The first of the three digit
instruction is the op-code. If the first digit is not ‘9’ or ‘0’, then the next two digits
represent a memory address operand.
• LMC instruction set is based on an accumulator architecture.

122
Limitations of LMC Instruction Set

The following highlights several limitations of the LMC instruction set.


• LMC has limited arithmetic and logic instructions. Common operations such as
multiplication and division are not supported. Even if possible, these must still be
implemented with many carefully designed instructions instead of a single instruction.
• LMC has only one general purpose register (i.e. ACC) and provided no efficient register-
to-register instructions.
• LMC has limited data movement instructions.
• LMC has no address manipulation instructions. Array and pointer operations require
manipulation of addresses. High-level programming languages cannot be easily
translated into LMC instructions.
Some major consequences of the above limitations are listed in the following:
• Code generation challenges. The compilers or the programmers must be capable of
generating program segments for various operations. This increases demand on the
quality of code generation.
• Memory operations. The increased number of instructions to execute increases the
frequency of memory operations, including the loading of instructions. This will increase
the time of execution of a program.
The following shows two examples of the difficulties in code generation with LMC
instruction set.

123
Example: Array Traversals

LMC unsuitable to support some high level language constructs, such as array traversal and
pointers. The following is a C program segment.

int array[10];
int i = 0;

for (; i<10; i++)


array[i] = 0;

If there were a C compiler for LMC, the compiler would be bounded by the limitation of the
LMC instruction set. A possible LMC code would be generated as the following. The code
generation process would decide that the array memory is address 80 to 89 and the variable i
is at address 90.

00 LDA 90 ; load variable i


01 SUB 97 ; sub constant 10
02 BRP 12 ; break the loop is if i >= 10
03 LDA 99 ; load constant 0
04 STO 80 ; store in array. STO 80 code is 380
05 LDA 04 ;
06 ADD 98 ; add constant 1
07 STO 04 ;
08 LDA 90 ;
09 ADD 98 ; add constant 1 to i
10 STO 90 ;
11 BR 00 ; return to beginning of loop
12 ... ; outside the loop
90 DAT 00 ; variable i
97 DAT 10
98 DAT 01
99 DAT 00

Some key points about the LMC program:


• The above code relies on changing the instruction at address 04 dynamically. The first
loop iteration store 0 to 80, the next loop iteration will store to 81. This is done by
instructions in 05 to 07, which adds one to address 04 per iteration.
• A lot of memory operations would happen. The simple operation of adding one to a
memory location involves 3 instructions and 6 memory operations.

124
Example: Pointers

The following shows an example of pointer manipulation in C

int* ptr;
int i = 0;

ptr = &i;
*ptr = 10;

The following shows an equivalent LMC program. The variable ptr is allocated at address
90, and variable i at 91.

00 LDA 99 ; load the constant 91, which is the address of i


01 STO 90 ;
02 LDA 90 ; load the content of ptr
03 ADD 06 ; add constant 300 which is for the store instruction
04 STO 06 ;
05 LDA 10 ; load the constant 10
06 STO 00 ; constant 300
07 COB ; finish
10 DAT 10 ;
90 DAT 00 ; variable ptr
91 DAT 00 ; variable i
99 DAT 91 ;

Some key points about the LMC program:


• Dynamic instruction modification is employed again so that the constant 10 is stored to an
address determined during runtime. The address is determined by the operand of
instruction 06. It is calculated by instruction 02 to 04.
• An assignment operation in C is now compiled and become several instructions in LMC,
and the result is more memory operations.

3. Operands and Instruction Set Architecture

Operands are a key to provide programmability in an instruction set architecture.

Implicit and Explicit Operands

In instruction set architecture, implicit operands and explicit operands are different in their
visibility in the instructions:
• Implicit operands are assumed in the specific instructions and they are not part of the
instruction format.
o The IN instruction of LMC have an implicit operand of accumulator (ACC).
The LMC code for IN is 901, which does not contain an operand in the
instruction format.

125
• Explicit operands are visible in the instruction format.
o The STO instruction of LMC copies the ACC value to a memory address.
The STO code is 3XX, where XX is the memory address operand. It has one
explicit operand (the memory address) and one implicit operand (the ACC).
Instruction length should be as short as possible to minimize memory usage and memory
operations. Implicit operands do not occupy space in the instruction format, and instruction
set architects would make operands of some instructions implicit for better performance.

Example: Implicit and Explicit Operands

Question: How many implicit and explicit operands are there in the LMC ADD
instruction?
Answer:
The ADD XX instruction carries out an addition operation on the ACC and the value in a
memory address. The result is stored in ACC.
ACC = ACC + MEM[XX]
There is one explicit operand, which is the memory address
There is one implicit operand, which is ACC.

Operand Addressing Modes

Operands in LMC instructions are referring to the desired location where the value can be
loaded or stored. For example, LMC instruction 5 08 is LDA 08, in which the explicit
operand 08 refers to the memory address storing the value to be copied to ACC. The implicit
operand ACC is the location to receive the value.
However, theoretically an operand value can be interpreted in different ways. Given the
operand value 08 above, these are some interpretations:
• 08 is the memory address of the referred location: LDA instruction copies the value in
memory address 08 to ACC.
• 08 is the value: LDA instruction copies 8 to ACC.
• 08 is the ID of a general-register: LDA instruction copies the value in register R8 to ACC.
• 08 is the memory address holding the memory address of the referred location:
The various interpretations are known as the different addressing modes for the operand.

Addressing Modes Remarks


Direct Value representing the memory address of referred location
Immediate Value itself
Register Value representing the ID of the referred general-purpose register.
Indirect Value representing the memory address holding the memory address of referred
location

The addressing mode of an operand of an instruction is part of the definition of the


instruction. It will be discussed again in the next sections.

126
4. Extended LMC

This section introduces the Extended LMC (E-LMC), with a new instruction set that has
incorporated some new features in the computer. The following summarizes the new features
in the E-LMC:
• Memory addressing space is extended to 1,000. The addresses range from 0 to 999.
• General purpose registers R4 to R7 are added. The accumulator (ACC) is preserved.
• Constant registers R0 to R3 are added.
• Two output devices are supported: (1) a seven-character LCD display based on ASCII
encoding and (2) a 3-digit LCD display based on signed decimal encoding.
• Two input devices are supported: (1) a buffered num-pad for entering a 3 digit decimal,
and (2) a buffered keyboard for entering a character.
• Memory-mapped IO is used instead of port-mapped IO. The IN and OUT instructions are
removed. Memory addresses 990 – 999 are reserved for input/output. Address 990 is
mapped to a 3 digit decimal output. Addresses 991 – 997 are mapped to an 7 ASCII
encoded character based output device. Address 998 is the buffer for the character based
input, and address 999 is the buffer for the 3 digit decimal input.

The above changes in the E-LMC require corresponding changes in the instruction set
architecture. New instructions should be added to take advantages of new features such as the
4 general purpose registers and the 4 constant registers.
• Register based instructions are added. They are for manipulation of the data stored in the
registers.
• Arithmetic and logic instructions are added. They are for improving the programmability.
Examples include multiplication and division.
• Memory-to-memory copy instruction is added.
• Instruction length is variable.

E-LMC is not backwardly compatible with LMC. LMC programs cannot run on E-LMC.
Maintaining backward compatibility is often difficult without paying the price in aspect of
performance and design extensibility. For example, E-LMC supports 1000 memory addresses
but the old LDA and STO instructions support address range from 0 to 99 only.

127
Structural Diagram of E-LMC

The following shows a structural diagram of E-LMC.

The following lists the major changes:


• There are general-purpose registers and constant register connected to the system bus.
• The accumulator (ACC) is now connected to the system bus in duplex and it has two
input ports.
• The input and output controllers are not connected to the system bus. They are now
connected to the memory address lines via an address decoder. The decoder monitors the
address line (via MAR) and intervenes if the address range is between 990 and 999. This
is the range of mapped IO registers.

General Purpose Registers and Constant Registers

E-LMC has four general-purpose registers, which should reduce the number of memory
operations.
• The general-purpose registers have ID from 4 to 7.
• The register ID will be used as an operand for instructions involving the general-purpose
registers.
E-LMC also has four constant registers, which are immutable and read-only.
• The constant registers have ID from 0 to 3.
• The register ID will be used as an operand for instructions in the same way as the general-
purpose registers.
The following table shows the constant values stored in the constant registers.

Register Constant Value (Decimal)


R0 0
R1 1
R2 2
R3 999
These constants are designed for convenience to the programmer. For example, to set a
register R4 to zero, an instruction of copying R0 to R4 will achieve it.
128
E-LMC Instruction Set Architecture
The following table is a summary of E-LMC instructions.

Instruction Addressing Mode Length Opcode Remarks


LDA #DAT Immediate 2 500 DAT ACC < DAT
LDA Addr Direct 2 510 Addr ACC < Mem[Addr]
LDA (Addr) Indirect 2 520 Addr ACC < Mem[Mem[Addr]]
LDA RN+Addr Register Index Relative 2 53R Addr ACC < Mem[RN + Addr]
LDA RN Register 1 54R ACC < RN
LDA (RN) Register Indirect 1 55R ACC < Mem[RN]
LDR RN,#DAT Immediate 2 56R DAT RN < DAT
LDR RN, Addr Direct 2 57R Addr RN < Mem[Addr]
STO Addr Direct 2 310 Addr Mem[Addr] < ACC
STO (Addr) Indirect 2 320 Addr Mem[Mem[Addr]] < ACC
STO RN+Addr Register Index Relative 2 33R Addr Mem[RN+Addr] < ACC
STO RN Register 1 34R RN < ACC
STO (RN) Register Indirect 1 35R Mem[RN] < ACC
MOV RN, RM Register 1 8NM RN < RM
ADD RM Register 1 10M ACC < ACC + RM
SUB RM Register 1 11M ACC < ACC – RM
MUL RM Register 1 12M ACC < ACC * RM
DIV RM Register 1 13M ACC < ACC / RM
BR Addr Direct 2 600 Addr Branch to Addr
BRP Addr Direct 2 610 Addr If (ACC >= 0) Addr
BRZ Addr Direct 2 620 Addr If (ACC == 0) Addr
CPY L, SAddr, DAddr Direct 3 7LL SAddr DAddr Copy data block of length L
from SAddr to destination
DAddr
HLT 1 000 Halt

Some important features about the E-LMC instruction set:


• There are six LDA instructions, in which the data source is specified in different address
modes. The data destination is always the ACC.
o The six LDA instructions are distinctive instructions. Although they share
the same mnemonic LDA, they are considered different.
o There is one explicit operand and one implicit operand (i.e. ACC) for most
addressing modes. For Register Index Relative mode, there are two explicit
operands (RN, Addr) and one implicit operand.
• There are two LDR instructions, which are similar to LDA except the data destination is a
general-purpose register.
o There are two explicit operands, the data source and the data destination.
• There are five STO instructions. The data source is always the ACC, but the data
destinations are specified with five different addressing modes.
• There is one MOV instruction for copying data between general purpose registers.
• There are four arithmetic instructions, and they operate on accumulator and other general-
purpose registers.
o Multiplication and division are included.
• There is a new instruction for copying a data block of a length from a starting address to a
destination address. This instruction is the longest.
129
E-LMC Instruction Format Issues

Instruction format is the way how the op-code and the operands of instructions packed
together in an instruction.
Logically sound and style consistency for instruction format design is important:
• Consistent format helps programmers to learn and prevents them from errors.
• Logically sound design facilitates processing of instructions in the processor and
improves performance.
E-LMC instruction format has the following features:
• Most instructions are two-word long and a few are one-word long.
• The op-code is always in the first word in two-word instructions. However, the op-codes
may occupy the first digit, the first-two digits, or all three digits.
• The first word in an instruction can be used to work out if the instruction is two-word
long.
• For some instructions such as the arithmetic instructions, there is more space in a two-
word format than required. The remaining space is padded (i.e. ignored).
o The last digit of the second word in the arithmetic instructions is padded.
The following shows the format of the E-LMC instructions graphically.

130
The following shows the different variants of LDA.
• The digit that acts as padding can be filled in anything. The processor would ignore it.
• The register addressing modes (direct and indirect) usually allows a short instruction
length. General-purpose register ID is usually one digit long. It would only occupy one
digit space in instruction format.

The following shows the register-based instructions, including the move instruction and some
arithmetic instructions. Again register-based instructions are short.

The CPY instruction is the longest one in E-LMC. It copies a data block of a length
(L) from a source address (SAddr) to a destination address (DAddr). The order of data
copying is from the beginning to the end. It has three explicit operands:
• Length: an integer from 0 to 99.
• Source address: an address from 000 to 999.
• Destination address: an address from 000 to 999.

131
Exercise: Instruction Format
The following shows an alternative instruction design for the two operand instruction ADD
RN, RM. It performs addition on two general-purpose registers:
RN = RN + RM.
Comment on this design.

Answer:
• The instruction length is 2 instead of 1.
• Padding is applied to word 1 and word 2. The two operands need two digits.
• The opcode must be distinguished from the current instructions. The opcode 19 is
used.

Exercise: Instruction Format


The above example of ADD RN, RM is two-word long. Is it possible to change it to one-
word long?
Answer:
• Yes, it is possible, but the register operands RM and RN must be crammed into word
0. There is only one digit left for opcode.

• Using one digit for the opcode is acceptable, but the opcode cannot be ‘1’ because the
need to differentiate between opcode of instructions. So the opcode ‘2’ is used because
there is no other instruction with opcode starting with ‘2’.
• It would however reduce the possible opcode available for adding new instructions.

Example Program: Sum 1 to 10


The following shows LMC and E-LMC programs for finding and printing the sum of integers
from 1 to 10.
E-LMC LMC
00 MOV R4, R0 ; R4 = 0 00 LDA 99
01 LDR R5, #10 ; R5 = 10 01 SUB 97
03 LDA R5 ; ACC = R5 02 BRP 10
04 BRZ 14 ; IF ACC == 0 03 LDA 98
06 LDA R4 ; ACC = R4 04 ADD 99
07 ADD R5 ; ACC = ACC + R5 05 STO 98
08 STO R4 ; R4 = ACC 06 LDA 99
09 LDA R5 ; ACC = R5 07 ADD 96
10 SUB R1 ; ACC = ACC – 1 08 STO 99
11 STO R5 ; R5 = ACC 09 BR 00
12 BR 04 10 LDA 98
14 LDA R4 ; ACC = R4 11 OUT
15 STO 990 ; output in 3-digit 12 HLT
17 HLT 96 DAT 1 ; constant
97 DAT 11 ; constant
98 DAT 0 ; sum
99 DAT 1 ; counter

132
The following lists the important points about the E-LMC program.
• The register-based instructions have no memory operation in the execution phase. The E-
LMC program uses a lot of these instructions and it should run significantly faster.
• E-LMC provides constant registers. The DAT definitions are not needed here.
• The instructions used in the E-LMC program are mostly of length 1 but one instruction
has length of 2 words. The last instruction HLT is at address 17 instead of 16, because
the STO instruction takes up 2 words.
The following shows the source code of the E-LMC program.

E-LMC Source Code


00 MOV R4, R0 840
01 LDR R5, #10 565 010
03 LDA R5 545
04 BRZ 14 620 014
06 LDA R4 544
07 ADD R5 105
08 STO R4 344
09 LDA R5 545
10 SUB R1 111
11 STO R5 345
12 BR 04 600 004
14 LDA R4 544
15 STO 990 310 990
17 HLT 000

133
5. Addressing Modes

In summary, E-LMC supports the following kinds of addressing modes:


• Direct addressing mode. The operand specifies an address of data.
o LMC instructions are in direct addressing mode.
• Immediate addressing mode. The operand is the desired data.
• Indirect addressing mode. The operand specifies an address that contains the address of
the desired data.
• Register indirect addressing mode. The operand specifies a register that contains the
address of the desired data.
• Register index relative addressing mode. There are two operands, one of which is a base
address and another operand is a register containing an offset value. The desired data is
found in an address that is the sum of the base address and the offset.
o This is known as relative addressing mode because the address of the desired
data is related to a base address.

Specifying Addressing Modes in Mnemonic Form

The address mode of an operand in instructions is expressed using the following syntax.

Addressing Mode Syntax Examples


Immediate #Data LDA #20
Direct Address LDA 20
Indirect (Address) LDA (20)
Register Index Relative RN + Address LDA R4 + 20
Register RN LDA R4
Register Indirect (RN) LDA (R4)

Case Studies of Instruction Set Architectures

The addressing modes covered in this chapter are the most common ones. In the real
world there are processors designed with many address modes, though most of them
are combinations or variants of the common addressing modes.

Processor Remarks
Intel 8086 17 addressing modes
Pentium 17 addressing modes (backward compatibility)
Itanium 1 addressing mode (register indirect addressing)
MIPS Register addressing mode mainly
Java bytecode Register indirect with offset in a stack architecture

134
Example: Operations of LDA of Different Addressing Modes
The following gives the content of a range of main memory addresses and some registers.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3

Work out the value loaded into ACC after the execution of the following instructions based
on LDA 22.
• Direct addressing LDA 22
• Immediate addressing LDA #22
• Indirect addressing LDA (22)
• Register addressing LDA R5
• Register indirect addressing LDA (R5)
• Register Index Relative Addressing LDA R5+2
Answer:

Direct addressing LDA 22


ACC will contain 25. The operand 22 is an address where the data is found.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3

Immediate addressing LDA #22


ACC will contain 22. The operand is the data.
Indirect addressing LDA (22)
ACC will contain 21. The operand 22 specifies an address that contains the address of the
desired data. Address 22 contains 25, which is the address containing the data 21.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3

Register addressing LDA R5


ACC will contain 23. The operand R5 is the register R5. R5 contains 23.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3

Register indirect addressing LDA (R5)


ACC will contain 26. The operand R5 is the register R5 that contains the address of the
desired data. R5 contains 23, which is the address containing the data 26.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3

135
Register Index Relative Addressing LDA R5+2
ACC will contain 21. The operand 2 is the base address and R5 contains the offset. The
resolved address is 23(which comes from R5) +2 = 25. This address contains the desired
data 21.
Address Content Address Content Register Content
20 20 24 1 R4 0
21 9 25 21 R5 23
22 25 26 0 R6 20
23 26 27 25 R7 3

Example: Operations of STO of Different Addressing Modes


The following gives the content of a range of main memory addresses and some registers.
Address Content Address Content Register Content
30 31 34 30 R4 5
31 0 35 2 R5 2
32 33 36 37 R6 31
33 35 37 34 R7 1

Work out the values in the address range given above after the execution of the following
instructions based on STO 32. Assume that ACC contains 18.
• Direct addressing STO 32
• Indirect addressing STO (32)
• Register indirect addressing STO (R6)
• Register Index Relative Addressing STO R5+32
Answer:

Direct addressing STO 32


The operand 32 specifies the destination where the data is stored. Address 32 will be
changed to 18 (which is the content of ACC).

Indirect addressing STO 32


The operand 32 specifies the address that contains the destination of the data. Address 32
contains 33 which is the destination of the data (ACC). Address 33 is changed to 18.

Register indirect addressing STO (R6)


The register R6 specifies the address that contains the destination of the data. R6 contains
31 which is the destination of the data (ACC). Address 31 is changed to 18.

Register Index Relative Addressing STO R5+32


The destination is the sum of address 32 and R5. It is 2 + 32 = 34. Address 34 is changed
to 18.

136
6. Instruction Execution in E-LMC

E-LMC is no different from LMC that the execution of instruction follows the fetch and
execution cycle.
• Some instructions are two word long. If a memory system supports data transfer size of 2
words, should the fetch phase always read in 2 words at a time?
• The operations of the execution phase can vary greatly. Most instructions do not require
memory operation in the executions phase. Their execution phase is short. A few
instructions are based on indirect or index relative address modes, the execution phase of
these instructions is longer.
The following shows the operations of fetch and execution cycle of various LDA instructions
in E-LMC. It is assumed that the memory fetch is 1 word each time.

LDA #DAT (Immediate Addressing) LDA Addr (Direct Addressing)


PC > MAR PC > MAR
M[MAR] > MDR M[MAR] > MDR
MDR > IR (loaded the first word) MDR > IR (loaded the first word)
PC + 1 > PC PC + 1 > PC
PC > MAR PC > MAR
M[MAR] > MDR M[MAR] > MDR
MDR > ACC (2nd word is data) MDR > MAR (2nd word is address)
PC + 1 > PC M[MAR] > MDR
MDR > ACC
PC + 1 > PC

LDA (Addr) (Indirect Addressing)


PC > MAR
M[MAR] > MDR
MDR > IR (loaded the first word)
PC + 1 > PC
PC > MAR
M[MAR] > MDR
MDR > MAR (2nd word is address ADDR, load its content)
M[MAR] > MDR
MDR > MAR (the content of ADDR is the address of desired data)
M[MAR] > MDR
MDR > ACC (the data is stored)
PC + 1 > PC

LDA (RN) (Indirect Register Addressing)


PC > MAR
M[MAR] > MDR
MDR > IR (loaded the first word)
RN > MAR (copy the value of RN to MAR, the data of the addr is loaded)
M[MAR] > MDR
MDR > ACC
PC + 1 > PC

137
LDA RN+Addr (Register Index Relative Addressing)
PC > MAR
M[MAR] > MDR
MDR > IR (loaded the first word)
PC + 1 > PC
PC > MAR
M[MAR] > MDR
MDR > IR (2nd op is the base address)
IR + RN > MAR (add the value of RN the offset, by control unit)
M[MAR] > MDR
MDR > ACC
PC + 1 > PC

• The requirement to load instructions in two memory operations (the first word and the
second word) increases the number of RTL steps.
• Some RTL steps could happen in parallel, such as the increment of PC, so that the
number of steps is reduced.
• The Instruction Register (IR) can store multiple words.
The following table summarises the differences between variants of LDA in the number of
memory operations.

Instructions # of Memory Operations


LDA #DAT (Immediate Addressing) 2
LDA ADDR (Direct Addressing) 3
LDA (ADDR) (Indirect Addressing) 4
LDA (RN) (Indirect Register Addressing) 2
LDA RN+ADDR (Register Index Relative Addressing) 3

Register based instructions usually takes less number of RTL steps.

MOV RN, RM (Register Addressing) ADD RN (Register Addressing)


PC > MAR PC > MAR
M[MAR] > MDR M[MAR] > MDR
MDR > IR (loaded the first word) MDR > IR (loaded the first word)
RM > RN ACC + RN > ACC
PC + 1 > PC PC + 1 > PC

138
The E-LMC branch instructions are all in direct addressing mode.

BR Addr (Direct Addressing) HLT


PC > MAR PC > MAR
M[MAR] > MDR M[MAR] > MDR
MDR > IR (loaded the first word) MDR > IR (loaded the first word)
PC + 1 > PC Stop the CPU
PC > MAR
M[MAR] > MDR
MDR > IR (loaded the second word)
IR > PC

The following table summarizes the number of memory operations of the register addressing
instructions and the branch instructions.

Instructions # of Memory Operations


MOV RN, RM (Register Addressing) 1
ADD R0, RN (Register Addressing) 1
SUB R1, RN (Register Addressing) 1
BR ADDR (Direct Addressing) 2

139
7. General Design Issues
Issues to consider in instruction set architecture design:
• The available functions.
• The addressing modes supported.
• The instruction format.
Programmers generally want more functions and therefore more instructions available, but
larger number of instructions increases the complexity. Computer designers must strike a
balance between performance and programmability:
• Each instruction needs a unique op-code.
• The space designated for op-code determines the maximum number of instructions
possible.
• Allowing longer instructions increases the space to cram op-codes and more operands
into an instruction format, but longer instructions need more memory operations to load.
There are two more common issues to consider in instruction set design.
• Number of explicit operands.
• Fixed instruction length design or variable instruction length design.

Number of explicit operands

Low number of explicit operands can keep the instructions size small.
The nature of the instruction determines the total number of operands.
• Arithmetic operations including addition and subtraction have two operands.
• Negation and branch have one operand.
• Halt has no operand.
However computer designers can make an operand implicit in the instruction and reduce the
size of instruction. For example, E-LMC assumes that one operand in ADD is the ACC.
Some instructions that have a lot of operands must inevitably add size to the format of
instructions. A computer designer is to decide whether to include these instructions in the
instruction set.

140
Fixed Instruction Length Design

In the fixed instruction length design approach, every instruction is of the same length. LMC
is fixed length, while E-LMC is variable length.
Fixed length allows more efficient instruction fetch.
• The fetch phase can read in 2 or 4 words at the same time.
• Some memory systems support fetching multiple addresses in one operation.
• MDR and IR sizes are larger to store more words in one instruction.
• The number of RTL steps is reduced.
The following shows an example of LDA under fixed instruction length design.

LDA ADDR (Direct Addressing) LDA ADDR (Direct Addressing)


Fixed Instruction Length (2-bytes) Variable Instruction Length
PC > MAR PC > MAR
M[MAR] > MDR M[MAR] > MDR
MDR > IR (loaded two words) MDR > IR (loaded the first word)
IR[ADDR] > MAR PC + 1 > PC
M[MAR] > MDR PC > MAR
MDR > ACC M[MAR] > MDR
PC + 2 > PC MDR > MAR (2nd op is address)
M[MAR] > MDR
MDR > ACC
PC + 1 > PC

However, fixed instruction design takes up more memory for storing instructions.
The length of all instructions is as same as the longest instruction in the set. For example, the
CPY instruction in E-LMC has length of 3. All other instructions are padded so that their
length is also 3.

141
8. CISC and RISC Architectures

There are two fundamentally different philosophies in instruction set design for processors.
The Complex Instruction Set Computer (CISC) philosophy is that a processor should
provide a large and rich set of instructions for its programmers and make efficient use of
memory.
• A typical CISC CPU supports as many as two hundreds instructions.
• The rich and flexible set of instructions eases the programming task and reduces the
number of instructions required to implement a program.
The philosophy of Reduced Instruction Set Computer (RISC) is that the performance of a
CPU can be greatly enhanced with simplifying the instruction set of the CPU.
• A RISC CPU has a small instruction set and executes their instructions extremely quickly
because the instructions are so simple.
• A typical RISC CPU, such as the SUN SPARC CPU, supports as few as 52 instructions.
In the state of the art processor design, the boundary between CISC and RISC architectures
are becoming more blurred.

Features of the CISC architecture are discussed in the following.


• Programming. The CISC philosophy is that a CPU should be easy to program and make
efficient use of memory. The total number of instructions required to implement a
program is reduced. The instruction sets are designed for the convenience of assembly-
language programmers.
• Instruction Format. Both the length of instructions and the number of CPU clock cycles
required to execute them vary.
• Code Generation. Easy for compliers to generate efficient code from high-level
languages. The code quality is not very dependent on the performance of a compiler.
• Hardware Implementation. Hardware logic for instruction decoding is complex due to
the fact that a single instruction needs to support multiple addressing modes. Mainly uses
microcode and microprogramming.

Features of the RISC architecture are discussed in the following.


• Programming. Many of the CISC instructions are rarely used, and by reducing the
instruction set, the CPU can be made to perform more efficiently. RISC CPUs only
include hardware support for the simplest and most commonly used instructions.
• Instruction Format. Lengths and formats of instructions are fixed. By making every
instruction identical in size and format, instructions can be fetched and decoded much
more efficiently than the case of variable-length instructions. Pipelining can be more
efficient too.
• Addressing Mode. The addressing modes supported are simple and limited.
• Register Support. There are many general-purpose registers that support the CPU
operations. RISC CPUs provide a large bank of registers for a program to store variables
and intermediate results.
142
• Code Generation. The code quality is depending on the optimization of the compiler.
• Hardware Implementation. Uses instruction pipelining and superscalar processing.
The CISC and RISC philosophies have opposite viewpoints on the roles of hardware and
software in serving the computing needs. CISC considers that hardware should serve
software. RISC, however, considers that software should take the responsibility for making
good use of hardware.

Case Studies of CISC and RISC

The following describes an example from each of the CISC and RISC architecture
approaches.
PowerPC CPU
• A RISC based processor.
• The instruction set has 224 instructions (divided into 6 categories: integer, floating point,
load/store, branch, processor, and memory control instructions).
• Fixed length instructions (all instructions are 32 bits long).
• Instructions may have zero to five operands.
• Most instructions use register addressing mode; only load/store and branch instructions
use memory addressing.
• There are around 70 registers for program use.
• A pipelined, superscalar architecture, with multiple different execution units, branch
prediction and out-of-order execution.
• A branch history table to improve branch prediction.

Pentium CPU
• A CISC based processor.
• The instruction set has 336 instructions (28 system, 92 floating point, 52 multimedia
extension, and 164 integer, logical and other general instructions).
• Variable length instructions.
• Instructions support zero to three operands.
• There are 12 different addressing modes.
• There are eight general-purpose registers and eight float point registers for program use.
• There are two five-stage pipelines, but does not use out-of-order processing technique.

143
COMPS266F Computer Architecture
Copyright  Andrew Kwok-Fai LUI 2017

Chapter 9. Architectures for High Performance Computing

This chapter discusses several architectural concepts that are pertinent to the design of high
performance computer systems.
• Super-scalar Processing and Pipeline Architecture
• Multi-core
• Mainframe Computing
These architectures are designed with performance scalability in mind. In other words, they
have the extensibility and flexibility to handle large-scale data processing tasks. Parallelism
is the basis of these architectures. The capability to carry out actions in parallel can achieve
greater performance boost than carrying out actions faster.
Pipeline architecture is an example of instruction level parallelism. It allows multiple
instructions to be executed at the same time.
Multi-core computer is an example of thread level parallelism. Individual threads can be
executed at the same time by individual cores.

1. Performance Metrics

The main role of computers is to perform tasks for people. The performance of a processor is
commonly expressed as the average number of instructions executed in a second.
• Instruction is the smallest unit of a recognizable task.
• The clock rate of a processor is the number of clock cycles per second. In each clock
cycle, the processor can take one step. The clock rate can indicate the work rate of the
processor.
• Due to different instruction set architectures, different processors take different number of
clock cycles to execute an instruction.
o For example, a processor running on a faster clock rate is not necessarily the
better performer. The processor may need many more clock cycles to
complete the execution of one instruction.
This performance measurement is usually expressed in the unit of MIPS (millions of
instructions per second), as processors are typically fast enough to execute over several
million instructions per second.

Exercise: CPU Performance in MIPS


Question: A high-level program is written to run on both Computer A and Computer B.
Computer A takes 300 million instructions to complete the execution of the program. The
time taken is 3 seconds. Computer B takes 100 million instructions to compete the
execution of the same program. The time taken is 5 seconds. (i) calculate the performance
of Computer A and B with MIPS (ii) explain why the same program can behave differently
in these two computers.

144
Answer:
(i) Computer A takes 3 seconds to execute 300 million instructions. The millions
instructions per second (MIPS) is 300 million / 3 seconds = 100 MIPS. Computer B takes
5 seconds to execute 100 million instructions. The millions instructions per second (MIPS)
is 100 million / 5 seconds = 20 MIPS.
(ii) There are a few reasons: (1) Computer A and B supports different instruction sets, and
so the same program is compiled with two sets of machine code. (2) The compilers are not
of the same quality and so one of them might have generated poor and inefficient code. (3)
The clock rates of the 2 computers are different. One of them may be slower.

Complex and Simple Instructions

The MIPS measurement has some merits for comparing performance of processors, but it
does not take into account the amount of work actually done by an instruction.
• Complex instructions generally take more clock cycles to complete than simple
instructions.
• For example, ADD (addition) and MUL (multiplication) are two instructions of different
levels of effort.
• Without MUL in an instruction set, a program would need a number of ADD and other
instructions to perform multiplication.
• One cannot compare the performance of two ways of doing multiplications without
looking into the detail performance parameters.

Exercise: Two ways of doing multiplication


Question: Two computers A and B with the following performance parameters:

Computer A Computer B
Clock Rate 1.5 GHz 3.0 GHz
MUL Instruction Clock Cycle 35 cycles Not provided
To perform a multiplication operation on Computer B, the best programmer can achieve
with the execution of 20 instructions, with an average 2 cycles per instruction. Which
computer can perform a multiplication operation faster?
Answer:
Time to execute one multiplication operation on Computer A:
One MUL instruction
Time = 35 cycles / 1.5 x 109 cycles per second = 23.3 x 10-9 seconds ~ 23.3 ns
Time to execute one multiplication operation on Computer A:
20 instructions with average 2 cycles per instruction.
Time = (2 cycles/inst * 20 inst) / 3.0 x 109 cycles per second = 13.3 x 10-9 seconds
~ 13.3 ns
Computer B can execute a multiplication operation faster, but it depends on the skill of the
programmer to write efficient code.

145
Performance for Enterprise Computing
Enterprise computing refers to the application of computing technologies for large-scale
business applications. Computing solutions for banks, financial institutions, logistics and
government are often based on enterprise computing technologies. These users are more
concerned with the number of tasks completed, and these tasks are business transactions,
processed orders, and requests handled. The performance measurement is therefore the
number of such tasks completed in a second.
The following are some figures obtained from a test of applying IBM Power 750 with 32
POWER7 cores in a bank (Reference: http://www.ameinfo.com/record-breaking-unmatched-
results-ics-banks-305031):
• 30000 concurrent users and 14700 financial transactions per second.
• 51431 transactions per second in ATM and Internet Banking activities.
• 401606 interest accounts processed per second.

Benchmarking
Benchmarking is the technique that compares the performance of two different computers by
measuring the time that each one takes to complete a set of particular programs. Benchmark
programs are a specially designed set of programs for measurement purposes.
• For a particular benchmark, the same workload is given to a set of computers to test their
performance.
• Benchmarking provides a common standard for comparing performance.
• Benchmarking is especially important for comparing computers of different architectures.
o Computers of same architecture may be compared at the design level:
instructions per cycle, clock rate, etc.
o Computers of different architecture have different instruction sets are difficult
to compare conceptually.
There are a number of industry standard benchmarks. These benchmark standards have been
scientifically tested so that the test results are consistent and re-producible. Here are some
examples:
• Standard Performance Evaluation Corporation (SPEC)
• Business Applications Performance Corporation (BAPCo)
Benchmarks are usually specific to a particular workload. Here workload means the type of
computer applications. Typical workloads are Business applications and Graphical
applications.
• The type of instructions executed by a Graphical application is different from that by a
Business application.
o Graphical application typically performs more floating point arithmetic (for
2D and 3D coordinate calculation)
o Business application typically performs more integer data movement and
some integer arithmetic.
• A CPU that is efficient on data movement and integer arithmetic will perform better with
business applications. The same CPU will not perform as well with Graphical
applications.
146
2. Pipeline Architectures and Instruction Pipelines
Pipeline architectures achieve very high performance by executing multiple instructions in
parallel. The time taken for individual instructions does not reduce. However, the overall
throughput is improved because there are more instructions executed per second.
Parallel execution of instructions is difficult to realize. The following shows three
instructions running in a sequence.

If the three instructions were to be executed at the same time, then theoretically the following
would happen.

For the above to happen, the following is required:


• The computer has the mechanism to fetch three instructions from main memory to CPU at
the same time.
• The computer has the mechanism to execute three instructions in the CPU at the same
time.
• The computer has the mechanism to handle multiple memory operations that may occur
due to the instruction execution.
The following diagram explains this situation more clearly by separating instruction execution
into four stages.
• Fetch the instruction from memory into the IR
• Decode the instruction in the IR
• Execute the instruction (may include memory operation)
• Write the result to a register, the accumulator, or the main memory

147
Basically the above model of instruction execution is similar to the Register Transfer
Language (RTL) perspective. The following shows the RTL for LMC ADD.

RTL Steps of ADD Instruction


PC > MAR
M[MAR] > MDR
MDR > IR
IR[ADDR] > MAR
M[MAR] > MDR
MDR + ACC > ACC
PC + 1 > PC

Instead of expressing in RTL, we are now viewing instructions as consisting of these four
phases: Fetch, Decode, Execute, and Write.
Some instructions will take longer in the Execute phase and other instructions are longer in
the Write phase. If an instruction needs one clock cycle to complete each of the four phases,
the total time required would be four time cycle. The following figure shows the phases
executing in a sequence in the time.

The following figure shows that when multiple instructions are executed at the same time,
more than one instruction is performing the same stage. For example, in clock cycle #3, all
three instructions are in the Execute stage. Three execution units such as ALUs may be
required.

148
Instruction Pipelining

Instruction pipeline describes a process that processes an instruction in several stages. The
output of one stage is passed to the input of another stage.
The separation of several stages has one major benefit: instructions may be executed at the
same time without the need for more execution units or multiple instances of other
mechanisms.
In instruction pipelining, each instruction is being handled at a different step in the instruction
cycle.
• CPU can handle several instructions at the same time but they are all at different stages.
o In the second time cycle below, the CPU is executing the Fetch phase of
instruction #2 and Decode phase of instruction #1.
• Each stage is handled by a dedicated component.
o A Fetch component is handling the Fetch stage of an instruction, and a
Decode component is handling the Decode stage of another instruction.
• The components in an instruction pipeline should operate independently and at the same
time.

Exercise: Performance Gain Due to Instruction Pipeline


Work out the performance gained due to the instruction pipeline.
Answer:
Consider the three instructions in the above diagram.
Sequential execution of the three instructions requires 12 cycles to complete
In the pipeline, the clock cycles required for the three instructions are 6 cycles.

149
Scalar Processing and Super-Scalar Processing
Instruction pipeline can theoretically achieve almost one clock cycle per instruction. If there
is no break in the execution, then the continuous overlapping of instruction execution can get
close to scalar processing.

The ability to execute one instruction in a clock cycle is called scalar processing. A CPU of
this class is thus called a scalar processor.
Superscalar processing is a design that employs more than one execution unit within the
CPU so that multiple instructions can be executed simultaneously. Superscalar processor can
execute more than one instruction in one clock cycle on average.
For example, a superscalar processor may contain one fetch unit and two execution units.
• The single fetch unit of the processor can fetch several instructions at a time.
• The fetched instructions are saved in the instruction buffer within the processor before
being fed into the execution units.
• The execution units can then performs steps of the execution phase of two instructions in
parallel.

150
Exercise: Super-Scalar Performance
A CPU has a Fetch component can fetch 6 instructions in one time cycle. The CPU has
two sets of Decode, Execute, and Write-Back components. Draw the instruction execution
status in the instruction pipeline and evaluate the performance.
Answer:

The Fetch component maintains the 6 instructions in the buffer until they are all decoded.
Then the Fetch component can fetch the next 6 instructions.
The above pipeline can achieve super-scalar performance in the long run. For example, the
above shows that 10 instructions can be executed in 9 clock cycles.

151
3. Efficiency and Hazards of Instruction Pipelines

The most efficient pipeline depends on a regular pattern of clock cycle in the various stages in
an instruction. The efficiency would drop if one of the stages takes two clock cycles instead
of one.
Pipelines are at the most efficient when all instructions in the same pipeline have the same
pattern of time cycles in various stages of execution.
For example, the following figure shows that the execution step for an instruction consuming
more than one clock cycle.

There are general pipelining hazards that will affect the performance of an instruction
pipelines.

Data hazards

Data hazards happen when an instruction depends on result of previous instruction that is
still in the pipeline.
• For example, the third step of an instruction needs a result that stored in a register from
step 4 of the previous instruction.
• Stalling the pipeline is a common solution to this kind of hazards is to insert one or more
stalls (wait states) in the pipeline.

152
Control hazards

Control hazards happen in the execution of branch instructions.


A branch instruction may invalidate all the instructions in the pipeline at the instant when the
branch is taken.

Here are some solutions to handle control hazards.


• Addition pipelines. Prepare two or more separate pipelines for possible outcomes. This
solution increases hardware course on the additional pipelines.
• Speculative execution. It predicts the branch based on history of previous execution of
the instruction. The execution carried out in the pipeline may become useless if the
outcome of the branch instruction is different from the prediction.

153
Structural hazards

Structural hazards mean that hardware cannot support the running of two instructions at the
same time even if they are in different stages
• For example, both steps require access to memory but there is only one memory port for
accessing data.
• Like the data hazard, a solution to structural hazards is to stall the pipeline by inserting
one or more bubbles in the pipeline.

Instruction reordering can be used to solve some of the hazards.


An instruction may be re-ordered if the following conditions are met:
• The re-ordered instruction is not dependent on the current instructions in the pipeline.
• The re-ordered instruction does not compete for the same component with the current
instructions in the pipeline.
The inter-dependency between instructions and the stages of instructions is the constraint on
whether instruction reordering is permissible. Two instructions (or stages) A and B are inter-
dependent if the result of A is needed by B.

154
Instruction Set Architecture and Pipeline Efficiency

Generally an instruction pipeline can be at most efficient if all instructions in the instruction
set are of the same pattern. Same clock cycle spent on each Fetch, Decode, Execute, and
Write-Back component.
Instruction pipelines work more effectively with RISC type of instruction set.
• Most instructions are of the same pattern of execution.
• Many instructions are register based and so they have fewer structural hazards due to
memory system bottleneck.

155
Modern Superscalar CPU

The processing power of modern CPUs is partially depending on instruction pipeline with
multiple execution units.
• Different types of execution units that are tailored to the needs of different types of
instructions.
• A complex steering system that can send instructions to various execution units.
• An algorithm to manage operands and retire instructions in correct program order.
• Able to process instructions out of program order in order to keep superscalar processing
effective
The following figure, adopted from Englander Page 221, illustrates the major components of
a modern CPU design.

A CPU equipped with multiple parallel execution units allows instructions with the same
pattern of execution be put in the same pipeline.
The CPU instruction decoder distributes instructions to a number of parallel execution units.
Each execution unit is optimized to perform one type of instructions.

156
4. CPU Implementation Approaches

There are two fundamental approaches to implementing CPU.


• Hardwire implementation uses dedicated logic to execute each instruction in the CPU's
instruction set.
• Micro-programmed implementation executes an instruction with translating it first into
several microcode instructions and then executing the microcode instructions. The series
of microcode instructions corresponding to each instruction in the CPU's instruction set is
contained in the CPU's built-in ROM.

Hardwire implementation approach

The hardwire implementation approach design dedicated hardware logic and circuitry for
each instruction. Then all these hardware logic and circuitry are embedded into a single chip.
• Each instruction had its own hardwired logic path to follow when being executed in that
CPU.
• The hardware logic circuits are then combined together to form the control unit of the
CPU.
• The control unit controls the state of the instruction cycle with the help of timing signal
generator. At the end of each stage in the execution cycle, the control unit issues signals
to tell the timing signal generator to initiate the next stage.
Advantage: This approach is straightforward to implement and worked well for simple CPU
architectures.
Disadvantage: This approach is not flexible to change. Consider what needs to be done when
you want to upgrade the CPU by adding several new instructions and modifying a few of the
existing instructions.

157
Microprogramming implementation approach

The microprogramming approach is based on the following observation that no matter how
complex an instruction, it can be broken down into a series of fundamental operations within
the CPU.
• Data movement: moving data from one register to another.
• Arithmetic and logic functions: performing simple arithmetic or logic functions on data in
registers.
• Conditional branches: making simple decisions based on the values stored in flags and
registers.
Rather than building separate hardware logic for each and every instruction, a number of
simple hardware logic units are built for internal CPU operations, and these internal CPU
operations are then used to form the instructions of the CPU.
The set of fundamental CPU operations is called microinstructions. These microinstructions
are then programmed to form the actual instructions of the CPU. The tiny programs that form
the CPU instruction set are called microcode. The CPU has built-in read-only memory to
store the microcode.

158
The following shows an example of executing the ADD instruction. The control unit
executes the micro-instructions according to the sequencing logic in the microcode library.

The advantages of the micro-programmed implementation of a CPU with an extensive


instruction set include the following.
• A simpler CPU design. The CPU is easier to implement and requires fewer hardware
logic units.
• A more flexible CPU design. Modifying the instruction set is as simple as modifying the
microcode.
However, the micro-programmed implementation has its disadvantage. As each step in the
fetch-execute cycle of a complex instruction is made up of a series of microinstructions, it
takes more clock cycles to complete that instruction.

159
5. Multi-Core Processors

Multi-core processors refer to processors that contain one or more processing units in a
physical chip package. Each core is an independent processing unit with its own cache
memory to providing instructions and data. One core can execute one program or threads,
and so multiple cores can execute multiple programs or threads at the same time. Multi-core
processors are examples of thread level parallelism.
The following figure (left) shows the architecture of a standalone computer. The figure
(right) shows two computers networked together. The following lists the characteristics of
this configuration.
• Two networked computers can run two programs at the same time. The throughput is
increased.
• There are two main memory systems. However, data exchange between the two relies on
the network (probably local area network). The data transfer rate is not fast, limiting the
potential for cooperation.

The following figure (left) shows a typical configuration of a multi-core computer system.
The multi-core processor below has four cores (commonly called quad-core). Each core has
its own cache memory and there is a connection to the single main memory system.
However, the connection is shared between the cores. This is called shared memory model.

160
The following lists the characteristics of this configuration:
• There is a single bus connecting the cores and the main memory, allowing a good data
transfer capacity between them.
• Each core is expected to execute individual threads or programs, of which the instructions
should be usually located in the local cache memory.
• When a core requires loading data from the main memory, then the performance
bottleneck of the shared memory model comes into play. Only one core can access the
main memory at a time.
Multiple-level cache memory is often used to reduce the chance of accessing the main
memory. Core i7, for example, has three levels of cache. The Level-3 cache is shared
between the cores.

Limitations of Multi-core Architectures

An important limitation of multi-core architectures is that parallelism is a potential and not


necessarily a reality.
• The program or thread running on a core must be assigned to the core.
• The job of assigning programs to cores is rested upon the operating system.
• There must be sufficient active programs or threads so that the operating system can use
them to occupy all the cores.
• Highest level of parallelism is achieved if the cores are always occupied with instruction
execution.
Programs running on a computer can utilize multi-cores only if they can be split into
independent tasks running at the same time.
• Some problems such as image processing can be easily separated into a lot of independent
tasks. Skilled programmers can exploit the potential of parallelism with multi-threading
programming.
• Other problems are not easily turned into parallel running tasks. Programmers may need
to put in a lot of effort to discover parallelizable tasks.

161
6. Enterprise and Mainframe Computing

(This section is adapted from the IBM Academic Initiative course on mainframe, and it is
used with permission)
Enterprise computing is the style of computing that satisfies the information processing need
of large enterprises.
• Very large amount of data (i.e. transaction data in stock market exchange)
• High availability (i.e. almost never break-down)
• Integrity and security (i.e. that the data are safe and correct is guaranteed)
• Scalability (i.e. the system capacity can increase gracefully)
A mainframe is what businesses use to host their commercial databases, transaction servers,
and applications that require a greater degree of security and availability than is commonly
found on smaller-scale machines.

Strengths of Mainframes

The following table summaries the major strengths of mainframes:

Strengths Description
Reliability Hardware: provides self-checking and recovery from error ability.
Software: extensively checked and tested.
Availability Usually measure in Mean Time Between Failure (MTBF) which may be months or
years in modern mainframe. Able to continuously operating while dealing with errors
or scheduled upgrade.
Serviceability Provides information about the source of failure and allow a rapid problem fix.
Security Provides a framework to manage authentication and prevent unauthorized access.
Scalability Provides a flexibility to change capacity with minimal impact on the operation and the
cost.
Continuing Enterprises typically invest a lot of money on application development on mainframe
Compatibility and it is important that such applications will continue to function even after decades.

162
Mainframes in the Modern World

Mainframe computers are usually hidden from public eyes. However they are the driving
force behind many essential day-to-day activities.
• Many of the Fortune 1000 companies use a mainframe system.
• Over 60% of all data available on the Internet is stored on mainframe systems.
• There are at least 10,000 mainframe systems still running in the world.
• Most banks in Hong Kong are supported by mainframes.
The yearly revenue generated from mainframe computing is still between 4 to 6 billions US
dollars. There are 2,000 to 3,000 mainframe systems shipped every year.
IBM is the largest mainframe vendor and probably the only large vendor still in the market.
IBM has continuously enhanced mainframe computing with the most current technologies.
The modern mainframe computer is no longer a room-size computer system. It has now
included distributed computing, cloud computing and virtualization in its armoury.
The current IBM mainframe systems are called the System/Z series. The following lists the
core features of IBM zEnterprise System introduced in 2010:
• The processor z196 chip is a quad-core 5.2GHz CISC processor.
• The z196 system can support a maximum of 24 processors.
• Each core may be assigned a specific role such as a typical Central Processor or an
Application Assist Processor for running Java/XML.
• Maximum memory is 3TB.

Mainframes Architectures

Mainframe architecture is continuously evolving due to emergence of new computing


technologies. There are however some architectural features that characterise a mainframe
system.

Architectural Features Relevant to core values


More processors and faster processors Large amount of data processing

More memory Large amount of data processing

Upgrading hardware and software dynamically Scalability and Availability

Enhanced IO capacity Large amount of data processing

Flexible resource provisioning (i.e. divide resources into Scalability


multiple, logically independent systems)

Distributed computing capability Scalability, availability

Encryption of data (i.e. AES encryption) Security

163
The specifics of the architectural features are usually worked out rigorously from Service
Level Agreement (SLA).
• A SLA is an agreement between a service provider and a recipient about the level of
performance required.
o For example, a bank may want 99% of ATM transactions to be completed in
one second.
• The number of processors, IO bandwidth, memory, etc is worked out from the required
performance level.
The following figure shows the conceptual structure of a traditional mainframe.
• Central processor contains processors, main memory, and other control circuitries.
• Large capacity data processing is enabled through a large number of channels, each of
which connects IO devices (such as hard-disks) to the memory storage.
• Processing capacity is scaled up through connecting IO devices to more Central
Processors. The Control Units manage the paths of data movement between IO devices,
channels, and other Control Units.
• More Central Processors can be connected when the processing capacity requirement is
increased.

The newer mainframe computers have more advanced features in IO connectivity and
configuration and system partition.

164
Mainframes Management

The physical appearance of a mainframe computer is based on frames. Frames are places
where mainframe components called cages are fixed. There are two types of cages:
• One Central Electronic Complex (CEC) cage. Contains the processor units (PU), the
physical memory, and connectors to the other cages.
o The CEC cage contains one to four books, where processors and memory are
put together as a physical unit.
• One to three IO cage. Each contains connection to external IO devices.
The hardware configuration, system images, etc is managed through a hardware
management console.

Processors

Mainframes are multi-processor systems. Each processor may be given a specific role and to
perform specific work.
• Central Processor (CP): to support normal operating system and application execution.
• System Assistance Processor (SAP): it provides a high reliability and availability IO
subsystem. It manages multiple paths to control units and performs error recovery.
• Integrated Facility for Linux (IFL): a Central Processor that cannot support z/OS, which
is an operating system provided by the IBM for Systems z. IBM charges less for this type
of processors.
• Integrated Coupling Facility (ICF): it couples together several z/OS based systems to
form a collaborative system.
• Spare: for use when there is a failure in other CPs, or to support Capacity upgrade on
Demand (CuOD) such as sudden increase in processing need.

Disk Devices: Direct Access Storage Device (DASD)

Direct Access Storage Devices (DASD) are advanced version of the typical hard disks used in
personal computer.
• DASDs are usually housed physically in a different location than the processors.
• A DASD has multiple disks arranged in a sophisticated manner for higher throughput and
higher reliability.

165
Virtualization and Partitioning

Virtualization is an important feature in mainframes so that the massive computing resources


can be suitably provisioned to individual applications.
The idea is to have individual applications to have the illusion of running its own hardware.
Applications can share the computing resources through partitioning the physical server into a
number of virtual servers.
IBM mainframes’ resources are managed by a hypervisor. Hypervisor is called a Control
Program (CP) in which users can create virtual servers with specific operating systems and
hardware provisioning.

• Logical partitioning (LPAR) involves the use of a hardware-based hypervisor (which is


the partitioning firmware to separate the operating system from the CPUs.
• Virtualization aims to achieve these four objectives: resource sharing, resource
aggregation, emulation of function, and insulation.

166

Das könnte Ihnen auch gefallen