Sie sind auf Seite 1von 93

MODULE: COMPUTER

ARCHITECTURE
1st
2nd

0 | P BSc_CA_621
age

Registered with the Department of Higher Education as a Private Higher Education Institution under the Higher Education Act,
1997.
Registration Certificate No. 2000/HE07/008

DEPARTMENT OF MEDIA INFORMATION AND


COMMUNICATION TECHNOLOGY
QUALIFICATION TITLE:
Batchelor Of Science In Information Technology
(Systems Engineering And Systems Development Specialisms)

LEARNER GUIDE
MODULES: Computer Architecture [1st Semester]
PREPARED ON BEHALF OF

PC TRAINING & BUSINESS COLLEGE (PTY) LTD


AUTHOR: VLZ MASHAKADA
EDITOR: ZIVANAI TARUVINGA
DEPARTMENT HEAD: Prof Rosh Maharaj
Copyright 2014
PC Training & Business College (Pty) Ltd
Registration Number: 2000/000757/07
All rights reserved; no part of this publication may be reproduced in any form or by any
means, including photocopying machines, without the written permission of the Institution.

1|Page

TABLE OF CONTENTS
TOPICS

Page No.

SECTION A: PREFACE

3-11

1. Welcome

2. Title of Modules

3. Purpose of Module

4. Learning Outcomes

5. Method of Study

6. Lectures and Tutorials

7. Notices

8. Prescribed & Recommended Material


9. Assessment

&

Key

Concepts

5-6
in

Assignments

and

6-8

Examinations
10.Specimen Assignment Cover Sheet

11.Work Readiness Programme

10

12.Work Integrated Learning (WIL)

11

SECTION B: Computer Architecture [1st Semester]

12-105

1. Introduction To Computer Organization And Architecture


2. Computer Evolution And Performance
3. Instruction Set Architecture
4. Computer Arithmetic And Number Systems
5. Memory-Cache Hierarchy
6. Operating Systems Support

2|Page

SECTION A: PREFACE
1. WELCOME
Welcome to the Department of Media Information and Communication
Technology at PC Training and Business College. We trust you will find the
contents and learning outcomes of this module both interesting and insightful as
you begin your academic journey and eventually your career in the Information
Technology realm.
This section of the study guide is intended to orientate you to the module before
the commencement of formal lectures.
The following lectures will focus on the common study units described:
SECTION A: WELCOME & ORIENTATION
Study unit 1: Orientation Programme
Introducing academic staff to the learners by Academic Programme
Manager. Introduction of Institution Policies.
Study unit 2: Orientation of Learners to Library and Students
Facilities

Lecture 1

Lecture 2

Introducing learners to Physical Infrastructure


Study unit 3: Distribution and Orientation of Operating Systems
Learner Guides, Textbooks and Prescribed Materials

Lecture 3

Study unit 4: Discussion on the Objectives and Outcomes of the


Operating Systems Module

Lecture 4

Study unit 5: Orientation and guidelines to completing


Assignments
Lecture 5
Review and Recap of Study units 1-4 and special guidelines to late
registrations

3|Page

2. TITLE OF MODULES, COURSE, CODE, NQF LEVEL, CREDITS &


MODE OF DELIVERY
1st Semester

Title Of Module:
Code:
NQF Level:
Credits:
Mode of Delivery:

BSc COMPUTER
SCIENCE
Computer Architecture
CA
6
10
Contact

3. PURPOSE OF MODULE
3.1 Computer Architecture
This course provides an overview of the organization and architecture of
general purpose computers. Learners will be exposed to several new aspects of
programming including hardware, embedded and register programming. A
variety of illustrative examples will be taken from across various Computer
Architecture platforms in order to entrench the required body of knowledge that
the syllabus seeks to establish at this level. The syllabus has been designed to
ensure that each and every chapter has a correlation with the next.
4. LEARNING OUTCOMES
On completion of this module, learners should be able to:
Define and explain Computer Architecture and Organization concepts
including functional components and their characteristics, performance
and the detailed interactions in computer systems including the system
bus, different types of memory, input/output as well as the CPU.
Employ Computer Architecture theory to solve the basic functional
hardware programming and organizational problem.
Assemble basic computer components.
Select and use the appropriate hardware and software tools for systems
integration.
5. METHOD OF STUDY
Only the key sections that have to be studied are indicated under each topic in
this study guide AND learners are expected to have a thorough working
knowledge of the specified/referenced sections of the prescribed text book.
These form the basis for tests, assignments and examinations. To be able to
complete the activities and assignments for this module, and to achieve the

4|Page

learning outcomes and ultimately to be successful in the tests and final


examination, you will need an in-depth understanding of the content of the
sections specified in the learning guide and the prescribed books. In order to
master the learning material, you must accept responsibility for your own
studies. Learning is not the same as memorizing. You are expected to show that
you understand and are able to practically apply the knowledge acquired. Use
will also be made of lectures, tutorials, case studies and group discussions to
present this module where necessary.
6. LECTURES AND TUTORIALS
Learners must refer to the notice boards on their respective campuses for details
of the lecture and tutorial time tables. The lecturer assigned to the module will
also inform you of the number of lecture periods and tutorials allocated to a
particular module. Prior preparation is required for each lecture and tutorial.
Learners are encouraged to actively participate in lectures and tutorials in order
to ensure success in tests, assignments and examinations.
7. NOTICES
All information pertaining to these modules such as tests dates, lecture and
tutorial time tables, assignments, examinations, etc. will be displayed on the
notice board located on your campus. Learners must check the notice board on
a daily base. Should you require any clarity, please consult your lecturer,
programme manager, or administrator at your respective campus.
8. PRESCRIBED & RECOMMENDED MATERIAL
8.1.

Prescribed Material:

8.1.1. Stallings, W., (2013), Computer Organization And Architecture, 8e.


Pearson Education, NJ.
The purchase of prescribed books is on the learners own account and is
compulsory for all learners. This guide will have limited value if not
accompanied by the prescribed text book(s).
8.2. Recommended Materials
8.2.1. Englander, I., (2010), The Architecture Of Computer Hardware, Systems
Software & Networking, 4e. John Wiley & Sons, Asia.

NB: Learners please note that there will be a limited number of copies of the
recommended texts and reference material that will be made available at your
campus library. Learners are advised to make copies or take notes of the
relevant information, as the content matter is examinable.

5|Page

8.3.

Independent Research:

The student is encouraged to undertake independent research.


8.4.

Library Infrastructure

The following services are available to you:


8.4.1. Each campus keeps a limited quantity of the recommended reading titles
and a larger variety of similar titles which you may borrow. Please note that
learners are required to purchase the prescribed materials.
8.4.2.
Arrangements have been made with municipal, state and other
libraries to stock our recommended reading and similar titles. You may use
these on their premises or borrow them if available. It is your responsibility to
keep all library books safe.
8.4.3. PCT&BC has also allocated one library period per week so as to assist
you with your formal research under professional supervision.
8.4.4. The computer laboratories, when not in use for academic purposes, may
also be used for research purposes. Booking is essential for all electronic library
usage.
9. ASSESSMENT
Final Assessment for this module will comprise two Continuous Assessment
tests, an assignment and an examination. Your lecturer will inform you of the
dates, times and the venues for each of these. You may also refer to the notice
board on your campus or the Academic Calendar which is displayed in all
lecture rooms.
9.1. Continuous Assessment Tests
There are two compulsory tests for this module.
9.2. Assignment
There is one compulsory assignment for this module. Your lecturer will inform
you of the Assessment questions at the commencement of this module.
9.3. Examination
There is one two hour examination for each module. Make sure that you diarize
the correct date, time and venue. The examinations department will notify you
of your results once all administrative matters are cleared and fees are paid up.
The examination may consist of multiple choice questions, short questions and
essay type questions. This requires you to be thoroughly prepared as all the
content matter of lectures, tutorials, all references to the prescribed text and any

6|Page

other additional documentation/reference materials is examinable in both your


tests and the examinations.
COMPETENCE

Knowledge

Comprehension

Application

Analysis

Synthesis

SKILLS DEMONSTRATED
Observation and recall of information
Knowledge of dates, events, places
Knowledge of major ideas
Mastery of subject matter
Question
Cues
list, define, tell, describe, identify, show, label, collect, examine, tabulate,
quote, name, who, when, where, etc.
Understanding information
Grasp meaning
Translate knowledge into new context
Interpret facts, compare, contrast
Order, group, infer causes
predict consequences
Question
Cues
summarize, describe, interpret, contrast, predict, associate, distinguish,
estimate, differentiate, discuss, extend
Use information
Use methods, concepts, theories in new situations
Solve problems using required skills or knowledge
Questions
Cues
apply, demonstrate, calculate, complete, illustrate, show, solve, examine,
modify, relate, change, classify, experiment, discover
Seeing patterns
Organization of parts
Recognition of hidden meanings
Identification of components
Question
Cues
analyze, separate, order, explain, connect, classify, arrange, divide,
compare, select, explain, infer
Use old ideas to create new ones
Generalize from given facts
Relate knowledge from several areas
Predict, draw conclusions
Question
Cues
combine, integrate, modify, rearrange, substitute, plan, create, design,
invent, what if?, compose, formulate, prepare, generalize, rewrite

7|Page

Evaluation

Compare and discriminate between ideas


Assess value of theories, presentations
Make choices based on reasoned argument
Verify value of evidence recognize subjectivity
Question
Cues
assess, decide, rank, grade, test, measure, recommend, convince, select,
judge, explain, discriminate, support, conclude, compare, summarize

The examination department will make available to you the details of the
examination (date, time and venue) in due course. You must be seated in the
examination room 15 minutes before the commencement of the examination. If
you arrive late, you will not be allowed any extra time. Your learner registration
card must be in your possession at all times.
9.4. Final Assessment
The final assessment for this module will be weighted as follows:
Continuous Assessment Test 1
Continuous Assessment Test 2
40 %
Assignment 1
Total Continuous Assessment
40%
Semester Examinations
60%
Total
100%
9.5. Key Concepts in Assignments and Examinations
In assignment and examination questions you will notice certain key concepts
(i.e. words/verbs) which tell you what is expected of you. For example, you
may be asked in a question to list, describe, illustrate, demonstrate, compare,
construct, relate, criticize, recommend or design particular information / aspects
/ factors /situations. To help you to know exactly what these key concepts or
verbs mean so that you will know exactly what is expected of you, we present
the following taxonomy by Bloom, explaining the concepts and stating the level
of cognitive thinking that these refer to.

8|Page

10. SPECIMEN ASSIGNMENT COVER SHEET


PC Training and Business College
BSc In Information Technology
Computer Architecture (1st Semester)
Assignment Cover Sheet(To be attached to all Assignments hand written or typed)
Name of Learner: Student No: ..
Module:
Date: .
ICAS Number
ID Number ..
The purpose of an assignment is to ensure that one is able to:
Interpret, convert and evaluate text.
Have sound understanding of key fields viz principles and theories, rules, concepts and awareness
of how to cognate areas.
Solve unfamiliar problems using correct procedures and corrective actions.
Investigate and critically analyze information and report thereof.
Present information using Information Technology.
Present and communicate information reliably and coherently.
Develop information retrieval skills.
Use methods of enquiry and research in a disciplined field.
ASSESSMENT CRITERIA
(NB: The allocation of marks below may not apply to certain modules like EUC and Accounting)
A. Content- Relevance.
Mark
Examiners
Moderators
Question Number
Remarks
Allocation
Mark
Marks
1
2
3
4
5
6
7
8
9
10
Sub Total
70 Marks
B. Research (A minimum of TEN SOURCES is recommended)
Library, EBSCO, Emerald Journals, Internet, Newspapers, Journals, Text Books, Harvard method of
referencing
Sub Total
15 Marks
C. Presentation
Introduction, Body, Conclusion, Paragraphs, Neatness, Integration, Grammar / Spelling, Margins on
every page, Page Numbering, Diagrams, Tables, Graphs, Bibliography
Sub Total
15 Marks
Grand Total
100Marks
NB: All Assignments are compulsory as it forms part of continuous assessment that goes
towards the final mark.

9|Page

11.WORK READINESS PROGRAMME (WRP)


In order to prepare learners for the world of work, a series of interventions over
and above the formal curriculum, are concurrently implemented to prepare
learners.
These include:
Soft skills
Employment skills
Life skills
End User Computing (if not included in your curriculum)
The illustration below outlines some of the key concepts for Work Readiness
that will be included in your timetable.
LIFE SKILLS

SOFT SKILLS

Time Management
Working in Teams
Problem Solving Skills
Attitude & Goal Setting
Etiquette & Ethics
Communication Skills

Manage Personal Finance


Driving Skills
Basic Life Support & First Aid
Entrepreneurial skills
Counselling skills

WORK
READINESS
PROGRAMME

EMPLOYMENT SKILLS

CV Writing
Interview Skills
Presentation Skills
Employer / Employee Relationship
End User Computing
Email & E-Commerce
Spread Sheets
Data base
Presentation
Office Word

It is in your interest to attend these workshops, complete the Work Readiness


Log Book and prepare for the Working World.
12. WORK INTEGRATED LEARNING (WIL)
Work Integrated Learning forms a core component of the curriculum for the
completion of this programme. All modules making up the Diploma/BSc in
Information Technology will be assessed in an integrated manner towards the
end of the programme or after completion of all other modules.

10 | P a g e

Prerequisites for placement with employers will include:


Completion of all tests & assignment
Success in examination
Payment of all fees arrears
Return of library books, etc.
Completion of the Work Readiness Programme.
Learners will be fully inducted on the Work Integrated Learning Module, the
Workbooks & assessment requirements before placement with employers.
The partners in Work Readiness Programme (WRP) include:

Good luck with your studies


Prof. Rosh Maharaj
Senior Director: Department of Information and Communication
Technology

11 | P a g e

SECTION B

Registered with the Department of Higher Education as a Private Higher Education Institution under the Higher Education Act,
1997.
Registration Certificate No. 2000/HE07/008

DEPARTMENT OF MEDIA INFORMATION AND


COMMUNICATION TECHNOLOGY
Bachelor Of Science In Information Technology

[Systems Engineering And Systems Development Specialisms 2nd Year]

LEARNER GUIDE
MODULE: Computer Architecture (2ND SEMESTER)
TOPIC 1 :
TOPIC 2 :
TOPIC 3 :
TOPIC 4 :
TOPIC 5 :
TOPIC 6 :

INTRODUCTION TO COMPUTER ARCHITECTURE AND


ORGANIZATION
COMPUTER EVOLUTION AND PERFORMANCE
INSTRUCTION SET ARCHITECTURE
COMPUTER ARITHMETIC AND NUMBER SYSTEMS
THE MEMORY/CACHE HIERARCHY
OPERATING SYSTEM SUPPORT

12 | P a g e

Unit 1: An Introduction To Computer Architecture And

Page Lecture Plan

Organization

16

1.1

Computer Architecture & Organization

16

1.2

Basic Structure And Function

17

Lecture 6

Unit 2: Computer Evolution And Performance

21

2.1

The First Generation: Vacuum tubes

21

2.2

The Second Generation: Transistors

25

2.3

The Third Generation: Integrated Circuits

25

2.4

Later Generations

26

2.5

Semiconductor Memory

26

Lectures 10-11

2.6

Microprocessors

27

Lectures 12-14

2.7

Design & Performance

29

2.8

Microprocessor Speed

29

2.9

The Evolution Of the Intel x86 Architecture

30

2.10

Embedded Systems And The ARM

32

2.11

ARM Evolution

32

Lectures 7-9

Lecture 16

Unit 3: Instruction Set Architecture

35

3.1

Addressing Modes

35

3.2

Addressing Formats

37

3.3

Computer System Performance & Performance

38

Lectures 17-19

Lecture 20-21

Metrics
3.4

Control Unit Operations

41

3.5

The Fetch Cycle

43

3.6

The Execute Cycle

43

Unit 4: Computer Arithmetic And Number Systems

46

4.1

Why Binary

46

4.2

Converting A Binary Number To Decimal

46

4.3

Converting A Decimal Number To Binary

47

4.4

Binary Addition

49

4.5

Binary Subtraction

50

4.6

Binary Multiplication

51

4.7

Binary Division

51

Lecture 22-23

Lecture 24

Lecture 25

13 | P a g e

4.8

Digital Logic

53

Lecture 26

4.9

Boolean Algebra

56

Lectures 27-29

4.10

Building Logic Circuits From Boolean Expressions

57

4.11

Assembly Language, Assemblers & Compilers

60

Lectures 30-31

Unit 5: the Memory-Cache Hierarchy

62

5.1

Basic Model

63

5.2

Cache Architecture

64

5.3

Write Policy

66

5.4

Cache Components

67

Lecture 33

5.5

Cache Organization

68

Lecture 34

5.6

Fully Associative Cache Organization

68

5.7

Direct Mapping Cache Organization

69

5.8

Set Associative Cache Organization

69

5.9

The Pentium Processor

70

5.10

Pentium Cache Organization

71

5.11

Operating Modes

71

5.12

Cache Consistency

71

5.13

Internal Memory

72

5.14

DRAM And SRAM

72

5.15

Types Of ROM

73

5.16

External Memory

75

5.17

Disk Characteristics

75

5.18

RAID Technology

75

5.19

Optical Disks

76

5.20

Magnetic Tape

76

5.21

Write-Once, Read-Many Disks

76

5.22

Input / Output

77

5.23

Buses

77

5.24

Programmed I/O

78

5.25

Interrupt-Driven I/O

78

5.26

Direct Memory Access

80

Lecture 32

Lecture 35-36

Lecture 37

Lecture 38

Lecture 39-41

Lecture 42-43

14 | P a g e

Unit 6: Operating Systems Support

83

6.1

Pipeline Performance And Limitations

85

Lecture 44

6.2

Pipeline Limitations

85

6.3

Reduced Instruction Set Computers

87

6.4

Instruction Execution Characteristics

87

6.5

Large Register Files

88

6.6

A Look At CISC

89

Lecture 46

6.7

Characteristics Of RISC Architectures

89

Lecture 47-48

6.8

RISC Pipelining

90

6.9

Organization Of The Pipeline

90

6.10

Super-pipelining

91

6.11

Instruction Level Parallelism And Superscalar

91

Lecture 45

Lecture 49-50

Processors

15 | P a g e

1. An Introduction To Computer Organization And Architecture


Section Objectives
At the end of this section, the learner should be able to:

Distinguish between Computer Architecture and Computer Organization


Define the key terms in Computer Architecture
Define Computer Function and Interconnection
Identify the main components of the CPU and their functions
Deliberate on Function and Structure

1.1 Computer Architecture And Organization


Computer Architecture is a phrase that has for ages been thrown around in a great deal of
computer-speak, but in real practice it is nigh impossible to discuss Computer Architecture aside
from Computer Organization. It is in all practical intents and purposes a lot easier to understand
Computer Architecture in conjunction with Computer Organization though a thin line divides the
two.
Computer Architecture refers to those aspects of a computer system that have a direct bearing to
the logical execution of a single task or a set thereof which we then collectively refer to as a
program. These aspects when put together are normally referred to as the [hardware] specification
of the computer because they provide the user with information relating to various aspects of the
computer such as the instruction set, the number of bits used to represent various data types [e.g.
numbers, characters, floating data points], the word size, I/O mechanisms and the memory
addressing techniques specific to that particular computer. So then, Computer Architecture does not
refer to the physical, tangible hardware configuration or arrangement pertaining to a computer
system, but instead reflects specific performance-level information relating to the hardware
components making up that computer system.
Computer Organization on the other hand refers to the operational units [or logic, better still] and
their interconnections that harness, or make use of, or implement the architectural specification to
achieve intended informational or technological goals or both. Computer Organization specifies
how, and to what extent a specific aspect of the Computer Architecture will be implemented in a
task or a program in order to achieve set, known and expected goals. For example, it is an
architectural design issue whether or not a computer will have a Multiply instruction, but it is on
the other hand an organizational issue whether that instruction will be implemented by a single
specific Multiply operation, [X*4], or by a repeated Add mechanism which makes use of the
systems specific Add operation as many times as the quantity is to be multiplied, [X+X+X+X].
Having come through this, we can envisage that it is possible for a computer manufacturer to come
up with a family of models of his brand computer based on the same architecture but all having
differences in organization. These differences are then what accounts for the disparities that these
models will exhibit in terms of power, performance and consequently, price. It is also possible for a
specific architecture to span many, many years with changes to the computer itself coming through
the organization avenue. Because of the scenario that the architecture upon which multiple models
are built remains the same for a foreseeable length of time, it makes it possible for a customer who
bought a lower-specification computer two years ago for example, to upgrade it to the current stateof the-art according to the shift in his technological or resource needs.

16 | P a g e

1.2 Basic Structure And Function


A computer is a very complex system and the key to understanding such a complex system is to
acquire a sure grip on its hierarchical nature. A hierarchical system is a set of inter-related
subsystems organized in such a manner that governing protocols of information interchange, in the
case of a computer system, have to be observed as we move from the highest subsystem to the
lowest level subsystem. Communication rules or elements of protocol make it impossible or illegal
to skip intermediate levels of the organization or protocol while the system is performing
information interchange of any kind.
A hierarchical system at each level of its organization consists of a set of components and the interrelationships governing the components. At each level there are two important aspects that must
concern us, and these are:
Structure: The manner in which the components are inter-related
Function: The operation of each member component as part of the overall Structure.
In terms of describing a hierarchical system, two approaches may be used: starting at the bottom
and building up to a complete description, or beginning with a top view and decomposing the
system into its subcomponents. This is known as the Top-Down approach and it is the approach that
this guide will adopt going forward unless otherwise specified. In this approach, we begin by
describing the structure and function of the major components of a computer proceeding
successively to the lower levels of the hierarchy.

Operating Environment
(Source and Destination of data)

Data
Movement
Apparatus

Data Storage
Facility

Control
Mechanism

Data
Processing
Facility

Figure 1.1 A functional View Of The Computer

17 | P a g e

Both the structure and function of a computer system are very simple, and Figure 1.1 shows in a
very simplified way that the computer performs four basic functions:
Data processing
Data storage
Data Movement, and
Control.
Essentially, a computer must process data and there are many reasons why data may be processed
and these data also take a wide variety of forms from employee details and student details for
example, which must be processed to produce correspondingly ordered records. These records,
which at this stage reflect useful consumable information are then stored. The computer system
implements storage in either of two ways. While the computer is still processing data, some of the
intermediate results which are going to be needed as input in subsequent processing stages will be
stored in a short-term data storage functionary called Random Access Memory [RAM]. This type of
memory is temporary and whatever is in RAM at the point the system shuts down will be
irretrievably lost. For this reason, RAM is said to be volatile. The other very important aspect of
storage arises when the results of data processing [information] are stored permanently on the
computer or on external devices such as flash drives, DVDs or any of the many available online
storage services. Most organizations use online storage services as an outsourced backup
alternative.
The computer must have the capacity to move data between itself and the external world. The
computers operational environment consists of devices which serve either as sources or
destinations of data. When data is received from or delivered to a device that is directly connected
to a computer, the process is known as input/output [I/O] and the device so involved is known as a
peripheral. When data are moved over long distances to and from a remote device, the process is
known as Data Communications.
Ultimately, there must be Control of all these three functions of a computer system and this control
within the computer is provided by a component called the Control Unit which manages the
computers resources and orchestrates the performance of the functional parts of a computer
system in response to user instructions which in many cases are in the form of some program.
Now moving onto the Structure of a computer system, Figure 1.2 reveals a very simplified portrayal
of the computer system. All computers will invariably interact with the external environment and
they generally achieve this through peripherals and linkages which we shall term communication
lines.

Peripherals
Peripherals

COMPUTER
Storage
Processing

Communication Lines
Communication Lines

Figure 1.2 The Computer

18 | P a g e

Though we have this depiction of the Computer System in the above figure, of greater importance to
us is the internal structure itself, of which the CPU is shown in Figure 1.3 below. There are four main
structural components in the computer, viz:
The Central Processing Unit [CPU]: This extremely important component is interchangeably
referred to as the processor and is the heart and the center of all manner of control on all
operations in the performance of data processing functions by the system.
Main Memory: Which pretty much is the RAM, stores [intermediate] data during processing
as earlier alluded to.
I/O: These functionaries move data between the computer and its external environment.
System Interconnection: This refers to the mechanism which provides communication
among the CPU, Main Memory and I/O. A common example of system interconnection is by
means of a System Bus consisting of a number of conducting wires to which all other
components attach.

CPU

Arithmetic
Logic Unit

Sequencing Logic
Main
Memory

I/O
Control
Unit

Registers

Control Memory

Figure 1.3 Internal Structure Of The CPU


Traditionally there has been only one Central Processing Unit in any one computer but in recent
years the trend has slanted toward the use of multiple processors in a single computer, especially in
bigger machines which emphasize on the need for raw power. Medium-sized servers which are now
commonplace in many server rooms today are examples of such machines, in particular Hewlett
Packards ML300 series.

19 | P a g e

Considering the four components of the computer system, the most complex and the most
interesting is the CPU whose major components are:
The Control Unit [CU]: Which controls the operation of the CPU and hence the computer as
a whole
The Arithmetic Logic Unit [ALU]: It performs the computers data processing functions
Registers: These provide intermediate storage which is internal to the CPU
CPU Interconnection: It exists in the form of a CPU internal bus (represented by block
arrows in Figure 1.3) and provides communication between the Control Unit, the ALU and
the registers.
There it is! We have come to the end of this first part of our module. Proceed to answer all of the
following painless questions and after satisfying yourself that you have adequately answered the
questions, you may turn to the next section of the guide.

Think Point: Before Attempting the Self-Assessment Exercise below, take time to go
through the prescribed book and read pages 26-32:
Stallings, W., (2013), Computer Organization And Architecture, 8e. Pearson Education, NJ.
Pay particular attention to the distinction that the text makes between Computer Architecture and
Computer Organization.
Try other similar texts and compare how following terms are defined and explained:
Computer Architecture
Computer Organization
CPU Structure
CPU Function

1.0
1.1
1.2
1.3
1.4
1.5

Self-Assessment Exercise
What in general terms is the distinction between Computer Organization and Computer
Architecture?
Explain the relationship between Computer Organization and Computer Architecture across
different models from the same manufacturer or vendor.
What in general terms is the distinction between computer structure and computer function?
What are the four main functions of a computer?
List and in each case briefly define the main structural components of a computer.
List and briefly define the main structural components of a CPU.

20 | P a g e

2. Computer Evolution And Performance


Section Outcomes
At the end of this section, the learner should be able to:

Relate to the evolution of computers through all generations to current state-of-the-art


Gain an understanding of the basic and more advanced terms in Information Technology
Understand and answer questions relating to instruction addressing and execution
Clearly relate to major microprocessor design techniques and how they are implemented
Basic microprocessor performance and performance metrics

At this point, the swimming toward the deep end begins. We begin by looking very briefly at
computer evolution and this is a subject which you have no doubt encountered already in the very
formative stages of your Information Technology studies. We look at the same time at how
computers performed throughout the evolution up until now, and note the similarities and
differences.
From the very first known computer up until now, the evolution of computers has been
characterized by increasing processor speeds, reduction in component size, increasing memory size
and increasing I/O capacity and speed. The great increase in processor speed has been made
possible by the reduction in the size of processor components, which reduction has cut down the
processing distances between components resulting in higher performance speeds. Modern
processor organization has also contributed to the current gains in processor speed of which the
most common organization technique is speculative execution in which the processor anticipates
instructions that might be needed in the near processing future and then processes them
beforehand.
We now look at this history through the technological events and milestones which separated the
different generations that marked the evolution of computers up to the current state-of-the-art.
2.1 The First Generation: Vacuum Tubes
The ENIAC [Electronic Numerical Integrator And Computer] was the first general purpose digital
electronic computer. It was designed and constructed at the University of Pennsylvania by John
Mauchly and John Eckert as a direct response to the requirements of the US army in WW2. The
project began in 1943, but by the time it was completed in 1946, the war target had already been
missed. The machine was huge, weighing more than 30 tonnes, containing 18,000 vacuum tubes and
covered about half a football pitch. It consumed unreasonably huge amounts of power and
compensated for this by offering processing speeds that were faster than any existing
electromechanical computer at the time.
The ENIAC was a decimal [and not a binary] machine. Memory was made up of 20 accumulators
each of which was capable of holding a 10-digit number. A ring of 10 vacuum tubes represented
each digit and at any time only one of the 10 vacuum tubes was in an ON state representing one of
the ten digits. A major setback in ENIAC architecture was that it was manually programmed by
setting switches and plugging and unplugging cables.
2.1.1 The von Neumann Machine was designed by John von Neumann, a Mathematician who was
also on the ENIAC project as a consultant. The von Neumann machine was necessitated by the
tedium which was associated with manually entering and altering programs in the ENIAC. A suitably
automated programming process could be facilitated if the program could be represented in a form

21 | P a g e

that made it suitable to store the program in memory alongside the data. This would make it
possible for the computer to get its instructions by reading them from memory and a program could
be set or altered by setting the values of certain portions of memory. This is what we now know as
the stored program concept and is attributable to John von Neumann and also to a certain extent,
the team that did the job on the ENIAC.
The first publication of the stored program concept was in a 1945 proposal by von Neumann for a
new computer, the EDVAC [Electronic Discrete Variable Computer]. In 1946, von Neumann and his
colleagues began the design of the first stored program computer which was referred to as the IAS
computer which was completed in 1952 and became the prototype of all subsequent generalpurpose computers. The general structure of the IAS is depicted in Figure 1.3 in Chapter 1 which
highlights the general structure of the CPU.
The IAS structure consists of:
A Main Memory, which stores both data and instructions
An Arithmetic Logic Unit [ALU] capable of operating on binary data
A Control Unit [CU] which interprets the instructions in memory and causes them to be
executed
I/O equipment operated by the CU.
Von Neumanns proposal of 1945 which resulted in the building of the von Neumann machine
contained a broad specification which built a case for all components which constitute the structure
of the computer and emphasis was laid on the internal structure of the CPU. With rare exceptions,
all computers as we know them today have their basis on von Neumanns proposal and they are
therefore all referred to as von Neumann machines.
The memory of the IAS contains 1000 storage locations called words of 40 binary digits each. Both
data and instructions are stored there. Numbers are represented in the IAS in binary form and
instructions are in the form of a binary code. Each number is represented by a sign bit and a 39-bit
value. A word may also contain two 20-bit instructions, with each instruction consisting of an 8-bit
opcode specifying the operation to be performed and a 12-bit address designating one of the words
in memory. The CU operates the IAS by fetching the instructions from memory and executing them
one at a time.

Sign Bit
Figure 2.1(a)

Number Word

22 | P a g e

Left Instruction
8

Opcode

Address

Right Instruction
28

20

Opcode

39

Address

Figure 2.1(b) Instruction Word


Both the CU and the ALU contain storage locations called registers defined as follows:
Memory Buffer Register [MBR] contains the word that is to be stored in memory or that is to
be sent to an I/O unit, or the word that is to be received from memory or from an I/O device
Memory Address Register [MAR] contains the address in memory of the word that is to be
written from, or read into the MBR
Instruction Register [IR] contains the 8-bit opcode instruction that is currently being
executed
Instruction Buffer Register [IBR] is temporarily employed to hold the right-hand from a word
in memory
Program Counter [PC] contains the address of the next instruction pair to be fetched from
memory
Accumulator [AC] & Multiplier Quotient [MQ] are a pair of registers used to hold temporary
operands and intermediate results of ALU operations.
The IAS computer operates by repeatedly performing an instruction cycle and each instruction
consists of two subcycles; the fetch cycle and the execute cycle. During the fetch cycle the opcode in
the next instruction is loaded into the IR and the address portion is loaded into the MAR. This
instruction may be taken from the IBR or it can be obtained from the main memory by loading a
word into the MBR and then down to the IBR, IR and MAR. Once the opcode is in the IR, the execute
cycle is initiated. Control circuitry interprets the opcode and interprets the instruction by sending
out the appropriate control signals to cause data to be moved or an operation to be performed by
the ALU.

23 | P a g e

Arithmetic Logic Unit [ALU]

AC

MQ
I/O Equipment

Arithmetic Logic Circuits

MBR

Program Control Unit

IBR

PC

IR

MAR

Control
Circuits

Main Memory

Control Signals
Control Signals

Figure 2.2 Expanded Structure Of The IAS Computer

24 | P a g e

Reading/Activity: Read the section von Neumann Machine in your prescribed textbook on
page 36 and study Figure 2.2 above. It is normal to encounter a question that will require you to
reproduce it as well as answer questions which derive their answers from the representation of
Figure 2.2.
Activity: Indicate the width in bits of each data path (e.g. between the AC and ALU)

2.1.2 Commercial Computers.


The 1950s saw the birth of the computer industry with two companies, Sperry and IBM dominating
the marketplace. In 1947, Eckert and Mauchly had formed the Eckert and Mauchly Computer
Corporation to manufacture computers commercially. Their first successful machine was the UNIVAC
1 [Universal Automatic Computer] which was successfully commissioned for use in the USA national
census of 1950. It was intended for both commercial and scientific applications. The UNIVAC thus
became the first successful commercial computer. Several variations of the UNIVAC were built
subsequently becoming commercially successful themselves and they exhibited enhancements in
processing power. All these computers illustrated trends that have remained characteristic of the
computer industry both in design and innovation.
2.2 The Second Generation - Transistors
The first major change in the electronic computers came with the replacement of the vacuum tube
by the transistor. The transistor is smaller, cheaper and dissipates much less heat than the vacuum
tube yet it can be used in much the same way as a vacuum tube to construct computers. Unlike
vacuum tubes which require wires, metal plates, glass capsules and a vacuum, the transistor is a
solid-state device made from silicon. Bell laboratories invented the transistor in 1947.
The use of the transistor defines the second generation of computers and it has become widely
accepted to classify computers into generations according to the fundamental technology
introduced and employed at certain technologically revolutionary points. Each new generation is
characterized by greater processing performance, larger memory capacity and smaller size than the
previous one. The second generation also witnessed the introduction of complex ALUs and CUs, the
use of high level programming languages and the provision of system software with the computer.
The second generation introduced several differences from the fundamental IAS architecture and
the most important of these is the use of data channels. A data channel is a single I/O module with
its own processor and its own instruction set. In a computer with such devices, the CPU does not
execute detailed I/O instructions. Such instructions are stored in the main memory to be executed
by a special-purpose processor in the data channel itself. The CPU initiates I/O transfer by sending a
control signal to the data channel instructing it to execute a series of instructions currently resident
in memory. The data channel then performs the task independently of the CPU and then signals the
CPU when the task is complete. This arrangement relieves the CPU of extra processing burdens.
2.3 The Third Generation Integrated Circuits
A single self-contained transistor is called a discrete component. Throughout the first and second
generations of computers, electronic equipment was mainly composed of discrete components
transistors, resistors, capacitors, etc. Discrete components were manufactured separately, packaged
in their own containers and then soldered or wired together onto Masonite-like circuit boards which

25 | P a g e

were then installed on computers and other electronic components and the manufacturing process
from transistor to circuit board was extremely expensive and cumbersome.
Problems in the manufacturing of computer equipment began to be visible when it became
increasingly unavoidable to introduce newer and more powerful machines without increasing the
number of transistors in the machine. As these powerful machines were being churned out, they
contained tens of thousands of transistors and this figure sky-rocketed with newer innovations and
the making of these machines grew increasingly more difficult.
In 1958 however came the achievement that revolutionized electronics and started the era of
microelectronics. This is the invention of the integrated circuit. It is the integrated circuit that defines
the Third Generation of Computers. The integrated circuit exploits the fact that such components as
transistors, resistors and conductors can be fabricated from a semi-conductor such as silicon. It is
merely an extension of the solid-state art to fabricate an entire circuit in a tiny piece of silicon rather
than assemble discrete components made from separate pieces of silicon into the same circuit.
Many transistors can be produced at the same time on a single wafer of silicon. These transistors can
be connected with a process of metallization to form circuits. A single wafer of silicon is divided into
chips and the chips then individually contain gates [which stipulate rules that make up logic If A
and B are TRUE, then C is also TRUE], memory cells and a number of input and output attachment
points. The chip is packaged in a housing that protects it and this housing provides pins for
attachment to devices beyond the chip itself.
Initially, only a few gates and memory cells could be reliably manufactured and packaged together.
These early integrated circuits are referred to as small-scale integration [SSI]. As time went on, it
became possible to pack more and more components into the same chip.
2.4 Later Generations
Beyond the third generation, there is less agreement on how to define the generations of
computers. Table 2.1 seems to suggest that there have been a number of later generations based on
advances in integrated circuit technology. With the introduction of large-scale integration [LSI],
more than 1,000 components can be placed on a single integrated circuit chip. Very Large-scale
integration [VLSI] achieved more than 10,000 components per chip while current ultra-large-scale
integration [ULSI] chips can contain more than one million components. With the rapid pace of
technology, the high rate of introduction of new products, and the importance of software and
communications as well as hardware, the classification becomes less clear and less meaningful.
Table 2.1 Computer Generations
Generation Approximate Dates

Technology

1
2
3
4
5
6

Vacuum Tube
Transistor
SSI and Medium SI
LSI
VLSI
ULSI

1946-1957
1958-1964
1965-1971
1972-1977
1978-1991
1991-

Typical Speed
[Operations/second]
40,000
200,00
1,000,000
10,000,000
100,000,000
1,000,000,000

2.5 Semiconductor Memory


The first application of integrated circuit technology to computers was in the construction of the
processor [the CU and the ALU]. It was also found out that the same technology could be used in the
construction of memory. In the 1950s and 1960s, computer memory was constructed from tiny rings
of ferromagnetic material. These rings were strung up on grids of fine wires suspended on small

26 | P a g e

screens inside the computer. Magnetized one way, a ring, called a core, represented a 1; magnetized
the other way, it stood for a 0. Magnetic-core memory was rather fast and it took as little as a
millionth of a second to read a bit stored in memory. But it was extremely expensive and bulky and
employed a technique called destructive readout: meaning that once read, the data stored in a core
would immediately be erased.
Semiconductor memory made its first appearance in 1970, and the chip, which was about the size of
a single magnetic-core could hold 256 bits of memory. It was non-destructive and much faster than
core. It took only one 70 billionth of a second to read a bit from memory, but the cost per bit was
higher than that of core. Developments continued in the semiconductor arena and in 1974, the price
per bit of semiconductor memory finally dropped below that of core. Following this, there has been
drastic continuing decline in the cost of semiconductor memory coupled with increasing physical
memory density. This has led to much smaller but still faster machines with memory sizes of the
larger and more expensive behemoths of the previous years. Since 1970, semiconductor memory
has been through 13 generations: 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M, 1G, 4G, and as
of this writing, 16Gbits can be packed on a single semiconductor memory chip. Each subsequent
generation has achieved 4 times the storage density of the previous one accompanied by declining
cost per bit and declining access time.
2.6 Microprocessors
Just as the density of elements on memory chips has continued to rise, so has the density on
processor chips. As the time went on, more and more elements were packed on each computer chip
so that fewer and fewer chips became necessary to construct a single computer processor. A
breakthrough was achieved in 1971 when Intel developed its 4004. The 4004 was the first chip to
contain all the elements of a CPU on a single chip: and the microprocessor was born.
The 4004 can add 2 4-bit numbers and can multiply only by repeated addition. In todays standards,
the 4004 is hopelessly primitive but it was the necessary starting point to a continuing evolution of
microprocessor capability and awesome power.
This evolution can be seen most easily in the number of bits that the processor deals with at a time.
There is no clear measure of this, but perhaps the best measure is the data bus width: the number of
bits of data that can be brought into or sent out of the processor at any one time. Another measure
is the number of bits in the accumulator or in the set of general purpose registers. Often these
measures coincide, but not always. For example, a number of microprocessors were developed that
can operate on 16-bit numbers but can only read and write 8-bits at a time.
The next step in the development of the microprocessor was the introduction in 1972 of the Intel
8008. This was the first 8-bit microprocessor and was twice as complex as the 4004. Neither of these
steps was to produce the impact of the next major event: the introduction in 1974 of the Intel 8080
the first general-purpose microprocessor. Whereas the 4004 and the 8008 were designed for
specific applications, the 8080 was designed to be the CPU of a general-purpose computer. The 8080
is an 8-bit microprocessor but much faster than the 8008 having a much richer instruction set and a
larger addressing capability. At the same time that all this was happening, 16-bit microprocessors
were being developed and at the end of the 1970s began to appear the 16-bit microprocessor of
which Intel dug in with their 8086. The next trend in these developments occurred when Bell
Laboratories and Hewlett Packard developed the first 32-bit microprocessor on a single chip. Intel
introduced their own 32-bit microprocessor, the 80386 in 1985.

27 | P a g e

Table 2.2 Evolution Of Intel Microprocessors


(a) 1970s Processors
4004
8008
Introduced
1971
1972
Clock Speeds
108Khz
108Khz
Bus Width
4-Bits
8-Bits
Transistors
2,300
3,500
Feature Size [m]
10
Addressable Memory 640 Bytes 16KB

8080
1974
2MHz
8-Bits
6,000
6
64KB

8086
1978
Up to 10MHz
16-Bits
29,000
3
1MB

(b) 1980s Processors


80286
Introduced
1982

386TM DX
1985

386TM SX
1988

386TM DX CPU
1989

Clock Speeds
Bus Width
Transistors
Feature Size [m]
Addressable Memory
Virtual Memory
Cache

16-33MHz
32-Bits
275,000
1
4GB
64TB
--

16-33MHz
16-Bits
275,000
1
16GB
64TB
--

25-50MHz
32-Bits
1.2Million
.8-1
4GB
64TB
8KB

6-12.5 MHz
16-Bits
134,000
1.5
16MB
1GB
--

8088
1979
5-8MHz
8-Bits
29,000
6
1MB

(c) 1990s Processors


Introduced
Clock Speeds
Bus Width
Transistors
Feature Size [m]
Addressable
Memory
Virtual Memory
Cache

486TM SX
1991
16-33MHz
32-Bits
1.185Million
1
4GB

Pentium
1993
60-166 MHz
32-Bits
3.1Million
.8
4GB

Pentium Pro
1995
150-200MHz
64-Bits
5.5Million
.6
64GB

Pentium II
1997
200-300MHz
64-Bits
7.5Million
.35
64GB

64TB
8KB

64TB
8KB

64TB
512KB L1
1MB L2

64TB
512MB L2

(d) Recent Processors


Pentium II
Introduced
1999
Clock Speeds
450-600MHz
Bus Width
64-Bits
Transistors
9.5Million
Feature Size [nm]
250
Addressable
64GB
Memory
Virtual Memory
64TB
Cache
512KB L2

Pentium 4
2000
1.3-1.8GHz
64-Bits
42Million
180
64GB

Core 2 Duo
2006
1.06-1.2GHz
64-Bits
167Million
65
64GB

Core 2 Quad
2008
3GHz
64-Bits
820Million
45
64GB

64TB
256KBL2

64TB
2MB L2

64TB
6MB L2

28 | P a g e

2.7 Design And Performance


Year by year, the cost of computer systems continues to drop dramatically, while the performance
and capacity of those systems continues to rise equally dramatically. This continuing technological
revolution has enabled the development of applications of astounding power and complexity.
Workstation systems now support highly sophisticated engineering and scientific applications as well
as simulation systems while simultaneously having the ability to support image and video
applications. In addition, businesses are relying on increasingly powerful servers to handle
transactions and database processing and to support massive client-server networks that have
replaced the huge mainframe computer centers of the past. The fascinating observation in all we
have come through so far is that the basic building blocks for todays computer miracles are virtually
the same as those of the IAS computer from over 50 years ago while on the other hand, the
techniques for squeezing the last iota of performance out of the materials at hand have become
increasingly sophisticated. This scenario helps to highlight the distinction between Computer
Architecture and Computer Organization.
2.8 Microprocessor Speed
While microprocessor power has raced ahead at breakneck speed, other critical components of the
computer have not kept up. In microprocessors, that addition of new circuits and the ability to pack
more and more circuits in a single chip making it possible to pack all the necessary components of
the microprocessor in one ULSI chip has made it possible to see the speed of microprocessors
quadrupling every three years while correspondingly memory access speeds have only trebled at
best.
This differential means that this raw speed and power of the microprocessor will not achieve its
potential unless it is fed a constant stream of work to do in the form of computer instructions.
Anything that gets into the way of that smooth flow undermines the power of the microprocessor.
While the chipmakers have been busy learning ways to come up with chips of greater and greater
density, microprocessor designers must correspondingly arise with ever more elaborate techniques
for feeding the monster. Among the capability-techniques built into the microprocessor are the
following:

Branch Prediction: The processor looks ahead in the instruction code fetched from memory
and predicts which branches, or groups of instructions, are likely to be processed next. If the
processor guess right most of the time, it can pre-fetch the correct instructions and buffer
them so that it is kept busy and does not have to wait while instructions are fetched on
demand. They are fetched before they are needed. Multiple branches can be predicted.
Branch prediction has the effect of increasing the amount of work available for the
processor to execute.
Data Flow Analysis: The processor analyzes which instructions are dependent on each
others results, or data to create an optimized schedule of instructions and this prevents
unnecessary processing delays.
Speculative Execution: Using branch prediction and Data Flow Analysis, some processors
speculatively execute instructions before their actual appearance in the program execution,
holding their results in temporary locations. This helps keep the processing engine as busy as
possible by executing instructions that are likely to be needed well in advance of the time of
their being required.

29 | P a g e

These techniques make it possible to effectively and efficiently exploit the sheer power and raw
speed of the processor. But as the speed of the processor skyrockets, another acute problem arises
because a speed differential is created between the processor and the main memory. The interface
between main memory and the processor is the most crucial pathway in the computer because it is
responsible for carrying a constant flow of program instructions and data between memory chips
and the processor. If memory or the pathway fails to keep pace with the processors insistent
demands, the processor will stall in a wait state and valuable processing time is lost and the
processors raw power is severely undermined.
There are a number of ways in which a systems architect can attack this problem, all of which are
reflected in contemporary computer designs. Consider the following:

Increase the number of bits that are retrieved at any one time by making the DRAMs wider
rather than deeper and by using wide data paths.
Change the DRAM interface to make it more efficient by including a 1cache or other
buffering scheme on the DRAM chip.
Reduce the frequency of memory access by incorporating increasingly complex and efficient
cache structures between the processor and main memory. This includes the incorporation
of one or more caches on the processor chip as well as an off-chip cache close to the
processor chip.
Increase the interconnect bandwidth between processors and memory by using higher
speed buses and a hierarchy of buses to buffer and structure data flow.

2.9 The Evolution Of The Intel x86 Architecture


Throughout this text, we will rely on many examples of computer design and implementation to
illustrate concepts and illuminate trade-offs where they exist. More often than not the text will rely
on examples from two computer families: the Intel x86 and the ARM architecture. The current x86
offerings represent the results of decades of design effort on complex instruction set computers
[CISCs]. The x86 incorporates complex and sophisticated design principles once found only on
mainframes and supercomputers and servers as an excellent example of CISC design. An alternative
approach to processor design is the reduced instruction set computer [RISC]. The ARM architecture is
used in a variety of embedded systems and is one of the most powerful and best-designed RISCbased systems on the market.
In terms of market share, Intel has consistently ranked as the number one maker of non-embedded
systems for decades, a position it is not unlikely to yield any time soon. Here is a list of some of the
highlights of the evolution of the Intel product line:

8080: The worlds first general-purpose microprocessor. This was an 8-bit machine with an
8-bit data path to memory. The 8080 was used in the first personal computer, the Altair.
8086: A far more powerful 16-bit machine. In addition to a wider data path and larger
registers, the 8086 sported an instruction cache or queue that pre-fetched a few instructions
before they were executed. A variant of this processor, the 8088, was used in IBMs first
personal computer, securing Intels success as a microprocessor manufacturer. The 8086 is
the first appearance of the x86 architecture.
80286: This extension of the 8086 enabled the addressing of 16Mbyte memory instead of
just 1Mbyte.

A cache is a relatively small but fast memory interposed between a larger but slower memory and the system logic that
accesses the larger memory. The cache holds recently-accessed data and is designed to speed the subsequent access to
that data.

30 | P a g e

80386: This was Intels first 32-bit machine and represented a major overhaul in the Intel
product line. With a 32-bit architecture, the 80386 rivalled the complexity and power of
minicomputers and mainframes introduced just a few years earlier. This was the first Intel
processor to support multi-tasking, meaning that it could run multiple programs all at the
same time.
80486: This microprocessor introduced the use of much more sophisticated and powerful
cache technology as well as instruction pipelining. A pipeline works in much the same way as
an assembly line in a manufacturing plant enabling different stages of execution of different
instructions to occur at the same time along the pipeline. The 80486 also included a built-in
math co-processor, relieving the CPU of complex math operations.
Pentium: With the Pentium, Intel introduced the use of superscalar techniques, which allow
multiple instructions to execute in parallel. Superscalar techniques allow multiple pipelines
within a single processor so that instructions that do not depend on one another can be
executed in parallel.
Pentium Pro: This processor continued the move into superscalar organization begun with
the Pentium, with aggressive of register renaming, branch prediction, data flow analysis and
speculative execution.
Pentium II: The Pentium II incorporated Intel MMX technology which is designed specifically
to process video, audio and graphics data efficiently.
Pentium III: This processor incorporates additional floating-point instructions to support 3D
graphics software.
Pentium 4: The Pentium 4 includes additional floating-point instructions and other
enhancements for multimedia.
Core: This is the first Intel x86 microprocessor with a dual core, referring to the
implementation of two microprocessors in a single chip.
Core 2: The Core 2 extends the architecture to 64-bits. The Core 2 Quad provides 4
processors on a single chip.

Over 30 years after its introduction in 1978, the x86 architecture continues to dominate the
processor architecture outside of embedded systems. Although the organization and technology of
the x86 has changed dramatically over the decades, the instruction set architecture has evolved to
remain backward compatible with earlier versions. All changes to the instruction set architecture
have involved additions to the instruction set with no subtractions. The rate of change has been one
instruction per month added to the instruction set over the last 30 years so that there are now over
500 instructions in the instruction set.

Reading/Activity: Study Table 2.4 on page 51 of the prescribed text.


Activity 1: The relative performance of the IBM 360 Model 75 is 50 times that of the 360 Model 30,
yet the instruction cycle is only 5 times as fast. Account for the discrepancy.
Activity 2: In the IBM 360 Models 65 and 75 addressing is staggered in two separate main memory
units (All even-numbered words in one unit and all odd-numbered words in the other). What might
be the purpose for this technique?

31 | P a g e

2.10 Embedded Systems And The ARM


The ARM architecture refers to a microprocessor architecture that has evolved from RISC design
principles and is used in embedded systems.
The term embedded system refers to the use of electronics and software within a product to enable
the product to perform a strict and dedicated function and in many cases, an embedded system is
itself part of a broader system, e.g. the clock and/or timing facility in your microwave oven.
Embedded systems far outnumber general-purpose computer systems encompassing a broad range
of applications. Often, embedded systems are tightly coupled to their environment. This can give rise
to real-time constraints imposed by the need to interact with that environment. Constraints such as
required speeds of motion, required precision of measurement and required durations dictate the
timing of software operations.
Table 2.3 Examples Of embedded Systems And Their Markets
Market
Embedded device
Ignition system
Automotive
Engine control
Brake system
Digital and analog televisions
Set-top boxes [DVDs, VCRs, Cable boxes]
Personal Digital assistants
Kitchen Appliances [fridges, toasters, microwave ovens]
Consumer Electronics
Automobiles
Toys/games
Telephones/cell phones/pagers
Cameras
Global positioning systems
Industrial Control
Robotics and control systems for manufacturing
Sensors
Infusion pumps
Dialysis machines
Medical
Prosthetic devices
Cardiac monitors
Fax machine
Photocopier
Office Automation
Printers
Monitors
scanners
2.11 ARM Evolution
ARM is a family of microprocessor and microcontrollers designed by ARM Inc. of Cambridge in
England. The company does not make processors but designs microprocessor and multicore
architectures and licenses them to manufacturers. ARM chips are high-speed processors known for
their small die size and very low power requirements. They are widely used in PDAs and other
handheld devices, including games and phones as well as a large variety of consumer products. ARM
chips are the processors in Apples highly popular iPod and iPhone handheld devices. ARM is the
most widely used embedded processor architecture and indeed the most widely used processor
architecture of any kind in the world.

32 | P a g e

The origins of ARM can be tracked back to the British-based Acorn Computers Company. In the early
80s Acorn was awarded a contract by the BBC to develop new microcomputer architecture for the
BBC Computer Literacy project. The success of this contract enabled Acorn to go on to develop the
first commercial RISC processor, the Acorn RISC Machine. The first version became operational in
1985 and was used for internal research and development as well as being used as a coprocessor in
the BBC machine. Also in 1895, Acorn released the ARM2, which greater functionality and speed
within the same physical space. Further improvements were achieved with the release in 1989 of
ARM3. Throughout this period, Acorn used the companys VLSI technology to do the actual
fabrication of the processor chips.

33 | P a g e

Self-Assessment Exercise
2.0 List the six basic types of register and state:
2.0.1 Their Function
2.0.2 Where they are found within the CPU?
2.1 Explain clearly what is meant by the stored program concept
2.2 Identify the first two major breakthroughs that ushered the technological revolution
characterizing the period between the first and third generations of computers.
2.3 Give a definition of what a solid state device is while citing an example and list the
advantages that solid state devices introduced into the Computer Architecture domain.
2.4 Explain what is meant by data channel and state the biggest advantage(s) that data
channels bring to Computer Architecture and design.
2.5 What is meant by ferromagnetic material and what is/was this material meant for?
2.6 How was ferromagnetic material used in the implementation of technology up until the
third generation of computers?
2.7 Define:
2.7.1 Pipelining
2.7.2 Superscalar processing
2.8 Clearly distinguish between CISC and RISC giving a simple example which makes use of a
simplified high-level language instruction like ADD, SUBTRACT, MULTIPLY, etc. your
illustration must show how either implementation treats the instruction at processor level.
Assume that all the variables are in memory.

Problem
2.0 The ENIAC was a decimal machine, where a register was represented by a ring of vacuum
tubes. At any time, only one vacuum tube was in an ON state, representing one of the 10
digits. Assuming that the ENIAC had the capability to have multiple vacuum tubes in the ON
and OFF state simultaneously, why is this representation wasteful and what range of
integer values could we represent using the 10 vacuum tubes?

34 | P a g e

3. Instruction Set Architecture


Section Outcomes
At the end of this section, the learner should be able to:

Relate to, understand and explain the main instruction addressing modes and formats
Obtain a deeper grasp of computer performance and performance metrics
Answer questions relating to the Control Unit and its operations
Demonstrate a basic understanding of the micro-operations that the CU goes through in
order to influence the completion of one computer instruction
Demonstrate by ably and aptly answering questions relating to the performance and
performance metrics of the Control Unit

3.1 Addressing Modes


An instruction consists of an opcode, usually with additional information relating to where the
operands are to be found and where results are to be stored. The general subject of specifying
where operands are is called addressing. In this section, we examine the most common types of
addressing techniques which in essence are:
Immediate Addressing
Direct Addressing
Indirect Addressing
Register Addressing
Register Indirect Addressing
Displacement Addressing, and
Stack Addressing.
3.1.1 Immediate Addressing
This is the simplest form of addressing in which the operand value is within the instruction itself. This
form of addressing is used to define and use constants as well as setting initial values of variables.
The obvious advantage of immediate addressing is that no further memory reference other than the
instruction fetch is required to obtain the operand, thus saving one memory or cache cycle in the
instruction cycle. The disadvantage is that the size of the number is restricted to the size of the
address field which in most instruction sets is small compared to the word size.
3.1.2 Direct Addressing
In this form of addressing, the address field contains the effective address of the operand. This
technique was common in earlier generations of computers, but has faded in contemporary
generations.

Think Point: Consider the advantages and disadvantages of this addressing technique and
list them below.
Advantage:
Disadvantage:

35 | P a g e

3.1.3 Indirect Addressing


With direct addressing the length of the address field is usually less than the word length, thus
limiting the address range. One solution is to have the address field in the instruction refer to the
address of a word in memory, which in turn contains a full-word address of the operand.

Think Point: Demonstrate your understanding of Indirect addressing by listing the


advantages and disadvantages of this addressing technique below.
Advantage:
Disadvantage:

3.1.4 Register Addressing


Register addressing is similar to direct addressing with the only difference being that the address
field refers rather to a register than to a memory location. Typically an address field that references
registers will have between 3 and 5 bits so that a total of from 8 to 32 general purpose registers can
be referenced. The advantages of register addressing are that: (1) only a small address field is
needed in the instruction, and (2) no time consuming memory references are required. The
disadvantages of register addressing is that the address space is limited.
3.1.5 Register Indirect Addressing
Just as register addressing is analogous to direct addressing, register indirect addressing is analogous
to indirect addressing. The address field in Register Indirect addressing refers to a register, which
register contains the memory address of the operand. The advantages of both Register Indirect
Addressing and Indirect Addressing are the same. In addition, register indirect addressing limits the
number of memory accesses/references to just one.
3.1.6 Displacement Addressing
A very powerful mode of addressing combines the capabilities of direct addressing and register
indirect addressing. Displacement addressing requires that the instruction have two address fields,
at least one of which is specific/explicit. The value contained in one address field is used directly. The
other address field, or an implicit reference based on opcode, refers to a register whose contents are
added to the contents of the explicit field to produce the effective address.
3.1.7 Stack Addressing
The last addressing mode that we are going to look at is the Stack addressing mode. A stack is a
linear array of locations in memory. It is also referred to as pushdown list or a last-in-first-out queue.
It is a reserved block of locations. Items are appended to the top of the stack so that at any given
time the block is partially filled. Associated with the stack is a pointer whose value is the address of
the top of the stack. Alternatively, the top two elements of the stack may be in processor registers,
in which case the stack pointer references the third element of the stack. The stack pointer is
maintained in a register, thus references to stack locations in memory are in fact register indirect
addresses.

36 | P a g e

Reading/Activity: Having come thus far in this unit, please read pages 418-431 in the
prescribed text for this module and answer the questions under the following activity:
Activity
Briefly define:
Immediate Addressing
Direct Addressing
Indirect Addressing
Register Addressing
Register Indirect Addressing
Stack Addressing

3.2 Addressing Formats


An instruction format defines the layout of the bits of an instruction in terms of its constituent fields.
An instruction format must include an opcode, and explicitly or implicitly, zero or more operands.
Each explicit operand is referenced using one of the addressing modes already dealt with so far. The
addressing format must implicitly or explicitly indicate the addressing mode for each operand.
3.2.1 Instruction Length
The most basic design issue to be faced is the instruction format length. This decision affects, and is
affected by memory size, memory organization, bus structure, processor complexity and processor
speed. The instruction length decision determines the richness and flexibility of the machine as seen
by the assembly language programmer. Programmers generally want more opcodes, more
operands, more addressing modes and greater addressing range. More opcodes and more operands
make it possible for programmers to write shorter programs and more addressing modes give the
programmer greater flexibility when implementing certain functions such as table manipulations and
multiple-way branching. All of these programmer frills require bits and more bits push in the
direction of longer instruction lengths. Longer instruction lengths may be wasteful. A 64-bit
instruction occupies twice the space as a 32-bit instruction, but it is probably less than twice as
useful.
3.2.2 Allocation Of Bits
For a given instruction length, there is a trade-off between the number of opcodes and the power of
the addressing capability. More opcodes obviously mean more bits in the opcode field. For an
instruction of given length, this reduces the number of bits available for addressing. The following
inter-related factors go into the determination of the addressing bits:
Number Of Addressing Modes: Sometimes an addressing mode can be indicated implicitly.
For example certain opcodes might always call for indexing. In other cases, the addressing
modes must be explicit and one or more mode bits will be needed.
Number Of Operands: We have seen that fewer addresses make for longer more awkward
programs. Typical instructions for todays machines provide for two operands. Each operand
address in the instruction might require its own mode indicator or the use of a mode
indicator could be limited to just one of the address fields.

37 | P a g e

Register versus Memory: A machine must have registers so that data can be brought into the
processor for processing. With a single user-visible register (called an accumulator), one
operand address is implicit and consumes no instruction bits. However single register
programming is awkward and requires many instructions. The more that registers can be
used for operand references, the fewer bits are needed. Most studies indicate that a total of
between 8 and 32 registers is desirable.
Number Of Register Sets: Most contemporary machines have one set of general purpose
registers, with typically 32 or more registers in the set. The addresses can be used to store
data, or can be used to store addresses in displacement addressing.
Address Range: For addresses that reference memory, the number of addresses that can be
referenced is related to the number of address bits. Because this imposes severe limitations,
direct addressing is seldom used. With displacement addressing, the range is opened up to
the length of the address register. Even so, it is still convenient to allow for large
displacements from the register address, which requires a relatively large number of address
bits in the instruction.

Reading/Activity: Please read pages 431-444 in the prescribed text for this module and
answer the questions under the following activity:
Activity
Briefly define:
What facts go into determining the use of the addressing bits of an instruction?
What are the advantages and disadvantages of using a variable-length instruction format?
What is the advantage of auto-indexing?
What is the difference between post-indexing and pre-indexing?

3.3 Computer System Performance And Performance Metrics


In evaluating processor hardware and setting requirements for new systems, performance is one of
the key parameters to consider along with cost, size, security, reliability and in many aspects, power
consumption. It is difficult to make meaningful comparisons among processors of the same family,
let alone among different processors. Raw speed is far less important than how a processor
performs when executing a given application. Application performance depends not only on the
processors raw speed but also on the instruction set, choice of implementation language, efficiency
of the compiler and the level of programming skill in the implementation of the application.
3.3.1 The System Clock
All operations performed by the processor such as fetching an instruction, decoding the instruction,
performing an arithmetic operation, etc., are governed by a system clock. All operations begin with
the pulse of the clock and the speed of the processor is dictated by the pulse frequency produced by
the clock and this pulse frequency is measured in cycles per second or Hertz [Hz]. Typically clock
signals are generated by a quartz crystal which generates a constant wave signal while power is
applied. This constant wave signal is then converted into a constant digital voltage pulse which is
supplied in a constant flow to the processor circuitry. A 1GHz processor receives 1 billion pulses per
second and this rate is known as the clock rate or clock speed. One incremental pulse of the clock is
referred to as a clock cycle or a clock tick. The time between pulses or clock ticks or clock cycles is
the cycle time.

38 | P a g e

Instruction Execution Rate


A processor is driven by a clock with constant frequency or, equivalently, a constant cycle time ,
where = 1/.
The instruction count Ic for a program is the number instructions executed for that program until the
program runs to completion. This is the number of instruction executions and not the number of
instructions in the object code of the program.
An important parameter is the number of cycles per instruction CPI for a program. If all the
instructions required the same number of clock cycles, the CPI would be a constant value for a
processor. However, on any given processor, the number of clock cycles required varies for different
types of instructions such as load, store, branch, etc. CPIi is the number of cycles required for
instruction type 1. Ii is number of executed instructions of type I for a given program. The overall
CPIi can then be calculated as follows:
)
CPI = (
_____________ .. [i]
Ic

The processor time T needed to execute a given program can be expressed as


T = Ic x CPI x
We must however remain cognizant of the fact that during the execution of an instruction, part of
the work is done by the processor and the other part involves the transfer of part of the instruction
word to and from memory. In this latter case the time to transfer depends on the memory cycle
time which may be greater than the processor cycle time. We therefore can rewrite equation [i] as
T = Ic x [p + (m x k)] x . [ii]
Where:
p is the number of processor cycles required to decode and execute the instruction and is
therefore equal to CPI in equation [i].
m is the number of memory references needed
k is the ratio between memory cycle time and processor cycle time.
The five performance factors in the preceding equation [Ic, p, m, k, ] are influenced by four system
attributes:
The design of the instruction set, also known as the instruction set architecture
Compiler technology which entails how effective the compiler is in producing an efficient
machine language program from a high level language program
Processor implementation, and
Cache memory hierarchy.
A common measure of processor performance is the rate at which instructions are executed,
measured in millions of instructions per second [MIPS] referred to as the MIPS rate. The MIPS rate is
expressed in terms of the clock rate and CPI as follows:

39 | P a g e

MIPS rate =

Ic
_____
T x 106

____________
CPI x 106

3.3.2 Benchmarks
Measures such as the MIPS have proved inadequate in the evaluation of the performance of
processors. Because of differences in instruction sets, the instruction execution rate is not a valid
means of comparing the performance of different architectures. Consider the high-level language
code:
A = B + C /* Assume all quantities are in memory
Using the CISC, this instruction can be compiled into one processor instruction:
Add mem(B), mem(C), mem(A)
On a typical RISC machine, the compilation will look something like the following:
Load mem(B), reg(1);
Load mem(C), reg(2);
Add reg(1), reg(2), reg(3);
Store reg(3), mem(A);
Because of the nature of the RISC machine, both machines may execute the original high-level
language instruction in about the same time. This would mean that the CISC machine may be rated
at 1MIPS while the RISC machine may be rated at 4MIPS though the two machines do the same
amount of work in the same amount of time.
The performance of a processor on a given program may not be a useful indication on how the same
processor will perform on a different application. Therefore beginning in the 1980s and early 1990s,
industry and academic interest shifted towards measuring the performance of systems using a set of
benchmark programs. The same set of programs can be run on different machines and the execution
times compared. A desirable benchmark program must have the following characteristics:

It is written in a high-level language, making it portable across different machines


It is representative of a particular programming style, such as systems programming,
numerical programming or commercial programming
It can be measured easily
It has wide distribution.

40 | P a g e

3.4 Control Unit Operations


The basic functional elements of the processor as we already know are:
The ALU
Registers
Internal data paths
External data paths, and
The Control Unit

The ALU is the functional essence of the computer. Registers are used to hold data internal to the
processor and some registers contain status information needed to manage instruction sequencing.
Others contain data that go to or come from the ALU, memory and I/O modules. Internal data paths
are used for inter-register data movement and between registers and the ALU. External data paths
link registers to memory and I/O modules, often by means of a system bus. The CU therefore causes
operations to happen within the processor.
The execution of a program consists of operations involving these processor elements which are
altogether controlled and galvanized by the CU. These operations consist of sequences of microoperations and all micro-operations fall into the following categories:
The transfer of data from one register to another.
The transfer of data from a register to an external interface.
The reverse route: The transfer of data from an external interface to a register.
Preforming the arithmetic or logic operation, using registers for input and output.
All the micro-operations needed to perform one instruction cycle, including all the micro-operations
to execute every instruction in the instruction set, fall into one of these categories.
In a nutshell, the Control Unit performs two basic tasks:
Sequencing: The CU causes the processor to step through a series of micro-operations in the
proper sequence based on the program being processed.
The CU causes each micro-operation to be performed.
3.4.1 Control Signals
The key to how the CU operates is its use of signals. For the control unit to effectively perform its
function, it must have inputs that allow it to determine the state of the system and outputs that
allow it to control the behavior of the system. These are the external specifications of the CU.
Internally, the CU must have the logic required to perform its sequencing and execution functions.
Figure 3.1 is a general model showing The CU inputs and outputs. The inputs are as follows:
Clock: This is how the CU keeps time. The CU causes one micro-operation, or a series of
micro-operations to be executed or performed for each clock pulse. This is sometimes
referred to as the processor cycle time or the clock cycle time.
Instruction Register: The opcode of the current instruction is used to determine which
micro-operations to perform during the execute cycle.
Flags: These are needed by the CU to determine the status of the processor and the
outcome of previous ALU operations.
Control Signals From Control Bus: The control bus portion of the system bus provides signals
to the CU, such as interrupt signals and acknowledgements.

41 | P a g e

The Outputs are as follows:


Control Signals Within The Processor: There are two types those that cause data to be
moved from one register to another and those that activate specific ALU functions.
Control Signals To Control Bus: Control signals to memory and to I/O modules.

Figure 3.1 Model Of The Control Unit Showing All Its Inputs And Outputs
There are three types of control signals that are used by the Control Unit:
There are those that activate an ALU function
Those that activate a data path, and
Those that are signals on the external system bus or other external interface. All these
signals are applied directly as binary inputs to individual logic gates.
Let us consider a fetch cycle to see how the control unit maintains its control over the function of
the system. The CU keeps track of where it is in the instruction cycle. At any given point, it knows
that the fetch cycle is to be performed next. The first step is to transfer the contents of the PC to the
MAR. The CU does this by activating the contents of the control signal that opens the gates between
the bits of the PC and the bits of the MAR. The next stage is to read a word from memory into the
MBR and increment the PC. The CU does this by sending the following signal simultaneously:
A control signal that opens gates, allowing the contents of the MAR onto the address bus
A memory-read control signal onto control bus
A control signal that opens the gates, allowing the contents of the data bus to be stored in
the MBR
Control signals to logic that adds 1 to the contents of the PC and stores the results back to
the PC.
Following this, the CU sends a control signal that opens gates between the MBR and the IR. This
completes the fetch cycle except for one thing: the CU must decide whether to perform an indirect
cycle or an execute cycle next. To decide this, it examines the IR to see if an indirect memory
reference is made.

42 | P a g e

3.5 The Fetch Cycle


The steps involved in the fetch cycle have already been dealt with at the beginning of this section,
but for reading convenience, we will repeat them here.

At the beginning of the fetch cycle, the address of the next cycle to be executed is in the PC.
The first step is to move that address to the MAR because this is the only register connected
to the address lines of the system bus.
The second step is to bring in the instruction. The desired address in the MAR is placed on
the address bus and the result appears on the data bus and is copied into the MBR. By now,
the PC needs to be incremented so that it points to the address of the next instruction that
comes after the current one under system scrutiny. These two actions, the reading of the
instruction address into the MBR and the incrementation of the PC are independent actions,
the system can do them simultaneously.
The third step is to move the contents of the MBR to the IR. This frees up the MBR for use
during a possible indirect cycle.

This fetch cycle therefore consists of three steps and four micro-operations. Each micro-operation
involves the movement of data into or out of a register. These micro-operations can be symbolically
represented as follows:
MAR
MBR
PC
IR

[PC] : The square brackets or parentheses indicate the contents of


Memory [MAR]
[PC] + L [Where L is the length of the next instruction]
[MBR]

3.6 The Indirect Cycle


Once an instruction is fetched, the next step is to fetch source operands. Let us assume a oneaddress instruction format with direct and indirect addressing allowed. If the instruction specifies
and indirect address, then an indirect cycle must precede the execute cycle. The Indirect Cycle
includes the following micro-operations:
MAR
(IR [Address])
MBR
Memory [MAR]
IR [Address]
(MBR [Address])
In this case, the address field of the instruction is transferred to the MAR. This is then used to fetch
the address of the operand. Finally the address field of the IR is updated from the MBR, so that the
IR now contains a direct rather than an indirect address. The IR is now in the same state as if indirect
addressing has not been used and is now ready for the execute cycle to begin.
3.6 The Execute Cycle
The sequence of events involved in the Fetch and Indirect Cycles look quite obvious and to a certain
extent predictable because the steps involved and the register usage exhibit considerable repetition.
Not so with the Execute Cycle. For a machine with N different opcodes, there are N different
sequences of micro-operations that occur. Let us consider an example:

43 | P a g e

First let us consider an ADD instruction:


ADD R1, X
which adds the contents of the location X to the register R1. The following sequence of microoperations might occur:
MAR
MBR
R1

(IR [Address])
Memory [MAR]
(R1) + (MBR)

In the first step, the address portion of the IR is loaded in the MAR. The referenced memory location
is read and the contents of R1 and MBR are added by the ALU.
We have come to the end of this unit and all that is left to do is to look at the Activity below and
complete it. Having read all the material above, you should complete this reading activity within 90
minutes.

44 | P a g e

Reading/Activity: Please read pages 579-602 in the prescribed text for this module and
answer the questions under the following activity:

Self-Assessment
1 Explain the difference between the written sequence and the time sequence of an instruction.
2 What is the relationship between instructions and micro-operations?
3 What is the overall function of a processors CU?
4 Outline a 3-step process that leads to a characterization of the CU?
5 What basic tasks does the CU perform?
6 Provide a typical list of inputs and outputs of the CU
7 List three Types of control signals

Problems
3.0 Consider 2 different machines with two different instruction sets, both of which have a clock
rate of 200 MHz. The following measurements are recorded on the two machines running a
given set of benchmark programs:

Machine
A

Instruction Type

Instruction Count

CPI

Arithmetic & Logic


Load & Store
Branch
Other
Arithmetic & Logic
Load & Store
Branch
Other

8
4
2
4
10
8
2
4

1
3
4
3
1
2
4
3

Determine the effective CPI, MIPS and execution time for each machine. Comment on the results.
3.1 While browsing at Billy Bobs computer store, you overhear a customer asking Billy Bob what
the fastest laptop computer in the store that he can buy is. Billy Bob replies, You are looking at
our Acers. The fastest Acer we have has a Core 2 Quad processor that runs at a speed of 2.0 GHz.
If you really want the fastest machine, you should buy our 2.4 GHz Core 2 Duo Mac. Is Billy Bob
correct? What would you say to help this customer?

45 | P a g e

4. Computer Arithmetic And Number Systems


Section Outcomes
At the end of this section, the learner should be able to:

Relate to, understand and explain the main numbering systems of which the focus of this
unit is the binary numbering system
Understand and explain why the Binary numbering system is so cardinal to computing
Perform basic Addition, Subtraction, Multiplication and Division operations in Binary
Convert given numbers from Binary to Decimal and the reverse
Understand basic Boolean Algebra and Logic gates and the representations thereof
Understand the functions and importance of Assemblers and Compilers

The Number System that we are all familiar with is the decimal number system which we use every
day and this system is anchored on, or expressed to base-10. When we write an expression like the
following: 24 + 13 or 28 7; the numerals or collective digits involved are in decimal or expressed to
base 10. There are other representations that can be used, the most common examples of which
are:
Octal: In which numbers are expressed to base 8, and
Hexadecimal: In which the numbers are expressed to base 16.
However in the field of computing, the most important numbering system is the binary system
which expresses or represents numbers to base-2. Because this is the cardinal numbering system in
the world of computing, it is therefore such as we will concentrate and focus our energies on in this
unit.
4.1 Why Binary?
Computers are built from transistors, and an individual transistor can only exhibit one of two states
at any given time - be ON or OFF [Two Options]. Similarly, data storage devices can be optical or
magnetic. Optical storage devices store data in a specific location by controlling whether light is
reflected off that location or is not reflected off that location [Two Options]. Likewise, magnetic
storage devices store data in a specific location by magnetizing the particles in that location with a
specific orientation. We can have the north magnetic pole pointing in one direction, or the opposite
direction [Two Options].
Computers therefore can most readily use two symbols, and therefore a base-2 system, or a binary
number system, is most appropriate. The base-10 number system [Decimal] has 10 distinct symbols:
0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. The base-2 system has exactly two symbols: 0 and 1. The base-10
symbols are termed digits. The base-2 symbols are termed binary digits, or bits for short. All base-10
numbers are built as strings of digits (such as 6349). All binary numbers are built as strings of bits
(such as 1101). Just as we would say that the decimal number 12890 has five digits, we would say
that the binary number 11001 is a five-bit number. The point: All data in a computer is represented
in binary and is read by the computer in bits.
4.2 Converting A Binary Number To Decimal
To convert a binary number to a decimal number, we simply write the binary number as a sum of
powers of 2. For example, to convert the binary number 1011 to a decimal number, we note that the
rightmost position is the ones position and the bit value in this position is a 1. So, this rightmost bit
has the decimal value of 1x20. The next position to the left is the twos position, and the bit value in
this position is also a 1. So, this next bit has the decimal value of 1x21. The next position to the left is

46 | P a g e

the fours position, and the bit value in this position is a 0. The leftmost position is the eights position,
and the bit value in this position is a 1. So, this leftmost bit has the decimal value of 1x23. Thus:
1011 = (1x23) + (0x22) + (1x21) + (1x20) = Decimal 11 or 1110.

Let us now express the binary number 110110 as a decimal number:


As a shorthand means of converting a binary number to a decimal number, simply write the position
value below each bit (i.e., write a 1 below the rightmost bit, then a 2 below the next bit to the
left, then a 4 below the next bit to the left, etc.), and then add the position values for those bits
that have a value of 1. The bits that have a value of 0 are ignored.

Think Point: Look closely at the conversion demonstration below and following through the
steps shown, then make an attempt to convert the binary number given to decimal.
To convert the binary number 10101 to decimal, we annotate the position values below the bit
values:
1 0 1 0 1
16 8 4 2 1
Then we add the position values for those positions that have a bit value of 1: 16 + 4 + 1 = 21. Thus
101012 = 2110
Convert The Binary number 1100101101101 to decimal.
Answer:

4.3 Converting A Decimal Number To Binary


There are two basic methods for converting from decimal to binary.
4.3.1 Method 1
Take for example a decimal number 90.
The first step is to ask yourself What is the largest power of 2 that is equal to or less than 90?
In this case we find it is 64, and 64 is 2 to the power of 6 = 26. We place a 1 in this position to
symbolically stand for (1x26)
In the second step, we subtract 26 from 90 and we get 26. What is the highest power of 2 that is
equal to or less than 24? The answer is 24 which is 16. We denote this position as (1x24)
In the third step, we subtract 24 from 26, and we get 10. What is the highest power of 2 that is equal
to or less than 10? The answer is 8 which is 23. We denote this position as (1x23)

47 | P a g e

In the fourth step, we subtract 23 from 10 and we get a 2. What is the highest power of 2 which is
equal to or less than 2? The answer is 2 which is 21. We denote this position as (1x21).
In the fifth step we subtract 21 from 2 and the answer is 0. At this juncture, the algorithm ends. Now
we look at the positions we have denoted above:
(1x26), (1x24), (1x23) and (1x21). This means that from the position 26 to the position 20 only these
denoted here will have a value 1, the rest will have a value 0 each. Now we list down the binary
number obtained:
26
1
64

25
0
0

24
1
16

23
1
8
90

22
0
0

21
1
2

20
0
0

4.3.2 Method 2
The second method of converting a decimal number to a binary number entails repeatedly dividing
the decimal number by 2, keeping track of the remainder at each step. To convert the decimal
number x to binary:
Step 1. Divide x by 2 to obtain a quotient and remainder. The remainder will either be 0 or 1.
Step 2. If the quotient is zero, you are done: Proceed to Step 3. Otherwise, go back to Step 1,
assigning x to be the value of the most-recent quotient from Step 1.
Step 3. The sequence of remainders forms the binary representation of the number writing the
remainders from last to the first.
Let us convert the decimal number 71 to binary using this method:
2 into
2 into
2 into
2 into
2 into
2 into
2 into
2 into

71
35
17
8
4
2
1
0

Remainder 1
Remainder 1
Remainder 1
Remainder 0
Remainder o
Remainder 0
Remainder 1

Now, taking up all the remainders from the bottom, 7110 = 10001112.

48 | P a g e

Reading/Activity: If you have access to the prescribed textbooks online resources, please
read online Chapter 19 which exclusively deals with Number Systems. At this stage read sections 191 to 19-3. If you have no access to this resource, read these guide notes and also as much as
possible, search the internet for related information. There is a tremendous lot of relevant
information and notes on number Systems alone. Try to go through the various examples and
questions given, and gauge your retention level.

Self-Assessment
1 Explain as briefly and comprehensibly as possible why the binary number system is the preferred
number system in computing.
2 Using Method 1, convert 21010 to binary.
3 Using Method 2, convert 21010 to binary and compare your answer to 1 above.
4 Convert 101101101 to decimal.

4.4 Binary Addition


Of all possible arithmetic manipulations on binary numbers, addition appears to be the most
Straight-forward. We will demonstrate this using an example:
Let us ADD the tow binary numbers, 101101 and 11011
The first step is to arrange the numbers one over the other in readiness for addition. It may be
desirable to pad leading zeros on the shorter number, although this may not be required:
101101
+ 011011
1001000
We add from right to left just as is the case with decimal addition. Adding a 1 and a 1, will give us
10, so we will put down the 0 and carry the 1 to the next column. In the next column, a 0
and a 1 will give us a 1 but we must not forget the carry from the previous column, so we add
this 1 to our current total of 1 and this will give us 10. We will put down the 0 and carry 1
to the next column. Now we are adding the third column from the right. We already have a carry of
1 from the previous column. Now this column gives us a total of 1, adding the carry from the
previous column, we get another 10 as was the case from the previous two columns. We will put
down the 0 and carry still our 1 to the fourth column moving from right to left. Now, this
column has a total of 10, but we need to add the carry from the previous column. This will give us
11. So we put 1 down and carry a 1 to the fifth column. The fifth column gives us a total of
1 and adding our carry to this we get 10 in which case we put down a 0 and carry 1.
Moving on to the sixth column, we already have a total of 1 and because we have a carry of 1,
we add that to the column to obtain 10. Because there is no other column after that, we put down
the whole 10 as is.

49 | P a g e

So we wind up with a total of 1001000. This number is 7210. We may now verify this addition using
decimal. The top number 101101 = 4510 and the bottom number 11011 = 2710. Adding these
numbers in decimal will give us 7210.
4.5 Binary Subtraction
We would like to use the same numbers for conveniences sake to accomplish an example on binary
subtraction. Binary subtraction is quite straight forward as we shall see. The subtraction of the
current two numbers follows.
4.5.1 Method 1 The Borrow method
We will use the borrow method when we find ourselves in situations where we have to subtract 1
from 0
101101
+ 011011
010010

= 4510
=2710
=1810

Again, moving from right to left, the first column is simple because we are subtracting a binary digit
(bit) 1 from another binary digit 1 giving us the answer 0. Moving on to the second column, we
are to subtract a 1 bit form a 0 bit. This means that we have to borrow from the next column (in
this case from the third column). The o bit in column two becomes 10 and the 1 bit in column
three becomes 0. Going back to column two, we have 10 minus 1, we get a 1 bit, which we
put down. Moving on to the third column, we know already that our 1 bit is now zero since it was
borrowed by column two. So a 0 bit is subtracted from this 0 bit and the answer is zero, which
we put down. We move onto the fourth column. A 1 bit is being subtracted from a 1 bit which
yields another 0 bit. Moving onto column number five, we encounter that awkward scenario
again, where a 1 bit is being subtracted from a 0 bit. Again we borrow the 1 bit from the sixth
column, and the 0 bit in the fifth column becomes 10. Now, 10 minus 1, we get a 1 bit,
which we put down, as usual. Going onto the sixth column, the 1 bit has been borrowed by the
fifth column, so we have a 0 bit here, not a 1 bit. So 0 minus 0 will give us a zero, which we
may put down for conveniences sake.

4.5.2 Method 2 2s Complement


The 2s complement method makes life quite straight forward in binary subtraction. In the
computer, negative numbers are stored in 2s complement. To convert a number, any number
whatsoever to 2s complement, just toggle its bits and add 1 to it. Toggling involves simple bit-wise
reversal, that is every 1 bit becomes a 0 bit and vice-versa. Having accomplished that, you now
add the two numbers. We shall continue with our two numbers from previous examples:
011011 needs to be subtracted from 101101. We must convert 011011 to 2s complement by
toggling its bits and it shall become 100100 + 1 which turns out to be 100101. Now we may proceed
with the subtraction (Note that the subtraction proceeds by addition from this point since the
number being subtracted is now negative) as follows:
101101
+ 100101
(1)010010

= 4510
= -2710
= 1810

In the result you may have observed an absurd 1 bit in parentheses. It is an overflow in our
calculation and may be ignored. The answer therefore is 010010 which is 1810.

50 | P a g e

4.6 Binary Multiplication


Binary multiplication uses the same technique as decimal multiplication. In fact, binary multiplication
is much easier because each digit we multiply by is either zero or one. Consider the simple problem
of multiplying 1102 by 102. We can use this problem to review some terminology and illustrate the
rules for binary multiplication.
110
x 10
1. First, we note that 1102 is our multiplicand and 102 is our multiplier.

110
x 10
2. We begin by multiplying 1102 by the rightmost digit of our multiplier which
000
is 0. Any number times zero is zero, so we just write zeros below.

3. Now we multiply the multiplicand by the next digit of our multiplier which
is 1. To perform this multiplication, we just need to copy the multiplicand
and shift it one column to the left as we do in decimal multiplication.

4. Now we add our results together. The product of our multiplication is


11002.

110
x 10
000
110

110
x 10
000
110
1100

4.7 Binary Division


We can calculate binary division problems using the same technique as long division in the decimal
system. It will be helpful to review some of the basic terms for division. Consider the division
problem below.
5
In this problem, the red 6 is the divisor, the blue 33 is the dividend, the
6|33
black 5 is the quotient, and the green 3 is the remainder. We will use
30
these same terms to describe how binary division is done.
3

51 | P a g e

Now let's look at a simple division problem in binary: 112 / 102 or 310 / 210. This time 102 is the divisor
and 112 is the dividend. The steps below show how to find the quotient which is 1.12.
1. First, we need to find the smallest part of our dividend that is greater than
or equal to our divisor. Since our divisor has two digits, we start by checking
10|11
the first two digits of the dividend.
1
2. 11 is greater than 10, so we write a 1 in the quotient, copy the divisor below 10|11
the dividend, and subtract using the borrow method.
10
1
1.
3. Since we have no more digits in our dividend, but we still have a remainder,
10|11.0
our answer must include a fraction. To finish our problem we need to mark
10
the radix point and append a zero to the dividend.
1
1.
4. Now we bring down the extra zero and write it beside our remainder. Then
10|11.0
we check to see if this new number is greater than or equal to our divisor.
10
Notice we ignore the radix point in our comparison.
10
1.1
10|11.0
5. 10 equals the divisor 10, so we write a 1 in the quotient, copy the divisor
10
below the dividend, and subtract. This completes our division because we
10
have no more digits in the dividend and no remainder.
10
0

Reading/Activity: If you have access to the prescribed textbooks online resources, please
read online Chapter 19 which exclusively deals with Number Systems. At this stage read sections 191 to 19-3. If you have no access to this resource, read these guide notes and also as much as
possible, search the internet for related information. There is a tremendous lot of relevant
information and notes on number Systems alone. Try to go through the various examples and
questions given, and gauge your level.

Self-Assessment
1 Add 1100101 to 101011 showing your working as clearly as possible
2 Perform this Addition: 10010111 + 11100110
3 Perform this subtraction 11111110 - 10010111 using both the borrow method and 2s complement
4 Perform the following subtraction showing your working 11100110 - 10010111 using both the
borrow method and 2s complement
5 What is 2s complement representation and why is it important?
6 convert 100100111010 to 2s complement.

52 | P a g e

4.8 Digital Logic


A logic gate is an elementary building block of a digital circuit. Most logic gates have two inputs and
one output. At any given moment, every terminal is in one of the two binary conditions low (0) or
high (1), represented by different voltage levels. The logic state of a terminal can, and generally
does, change often, as the circuit processes data. In most logic gates, the low state is approximately
zero volts (0 V), while the high state is approximately five volts positive (+5 V).
There are six basic logic gates: AND, OR, XOR, NOT, NAND, and NOR. We will examine each in turn.
4.8.1 The AND Gate
The AND gate has two or more inputs which always result in one output. The output is as follows:
Output = 1 (high) when all input are high and Output = 0 (low) when at least one of the inputs is low.
Below is a diagram of both the AND Gate and its Truth Table (proof).

4.8.2 The OR Gate


The OR gate also has two or more inputs which always result in one output. The output is as follows:
Output = 1 (high) when at least one of the inputs are high (1) and Output = 0 (low) when all inputs
are low.

53 | P a g e

4.8.3 The NOT Gate (Also Called The Inverter)


The Not Gate only has one input and only one output. The output is as follows: Output = 1 when the
input is 0 (low) and Output = 0 when the input 1 (high). It inverts the input and what comes out is
the complement of the input.

4.8.4 The NAND Gate


The NAND gate has two or more inputs and one output. The output is as follows: Output = 1 (high)
when at least one of the inputs are low (0) and Output = 0 (low) when all of the inputs are high (1).

54 | P a g e

4.8.5 The NOR Gate


The NOR Gate has two or more inputs and one output. The output is as follows: Output = 1 (high)
when all the inputs are low (0) and Output = 0 (low) when at least one of the inputs is high (1).

4.8.6 The Exclusive-OR Gate


The Exclusive-OR Gate always has two inputs and a resultant output. The output is as follows:
Output = 1 (high) when the inputs are not similar and = 0 (low) when the inputs are similar.

4.8.7 The Exclusive-NOR Gate


The Exclusive-OR Gate always has two inputs and a resultant output. The output is as follows:
Output = 1 (high) when the inputs are similar and = 0 (low) when the inputs are not similar.

55 | P a g e

4.9 Boolean Algebra


Boolean algebra has similar rules to other algebras and these rules are used to manipulate the
expression at hand. Some of these basic rules are:
One variable:
A.A = A
A+A=A
One variable and 0 or 1:
A.0 = 0
A.1 = A
A+0=A
A+1=1
DeMorgans Theorem:

(
) = .
() =
Associative:
(A.B).C = A.(B.C)
(A +B) + C = A + (B + C)
Commutative:
A.B = B.A
A+B=B+A
Distributive:
A.(B + C) = A . B + A.C
A + (B.C) = (A + B).(A + C)
Note: The OR operator is represented by the + symbol and has the lowest precedence. The NOT
operator is represented by the symbol and has the highest precedence. The AND operator is
represented by . symbol and is in the middle.
As an example, let us simplify a Boolean expression using some of the above rules. Thereafter we
can look at how we can build logic circuits from these Boolean expressions.
The expression that we need to simplify is AB D + A D + BCD + A CD + BC
Using the above rules we can simplify the expression as follows:
F = AB D + A D + ABCD + A CD + BC
= AB D + A D + BCD + BC + A CD
= A D(B+ ) + BC(D+ ) + A CD [ (0 + = 0 + 1 = 1 also 1 + = 1+ 0 = 1)]
= A D(1) + BC(1) + A CD
= A D + BC + A CD
= A D + A CD + BC
= AD( + C) + BC
= AD(( + )( +C)) + BC [Distributive Law]
= AD(( + )(1)) + BC
F = AD( ) + BC
[DeMorgans]
You can solve any similar problem the very same way.

56 | P a g e

4.10 Building Logic Circuits From Boolean Expressions


Assuming that we have some Boolean functions representing some computer logical circuits that we
want to design, how do we go from Boolean function to computer circuit? Consider the following
functions:
(a)
(b)
(c)
(d)

F=a+b+c
F=a+
F = b + ab
F = ( + b).(a + b)

The two designs in (a) are the same but the lower design is more expensive because it contains more
logic gates and circuitry. The upper design is therefore more desirable because it is minimalist and
achieves the same goal as the lower one. We move on to (b) and (c). (b) is a simple logic OR circuit in
which one of the inputs is inverted, but (c) is an example of a Sum Of Products (SOP). The products
(ab and b) are summed (+).

57 | P a g e

Finally we move on to (d) which is an example of a Product Of Sums (POS) in which the sums
involving input a and input b are ANDed. See next page.

58 | P a g e

59 | P a g e

Reading/Activity: If you have access to the prescribed textbooks online resources, please
read online Chapter 20 which exclusively deals with Digital Logic. At this stage read sections 20-1 to
20-5. If you have no access to this resource, read these guide notes and also as much as possible,
search the internet for related information.
You are encouraged to do a little research on such topics as Sum Of Products (SOP), Product Of Sums
(POS), Truth Tables and The Use of Truth Tables.

Self-Assessment
1 Divide 00011 into 10010 showing your working as clearly as possible
2 Perform this Multiplication: 1010 x 1100
3 Reduce the expression F = b + a c + abc to its simplest form
4 Draw the resultant logic circuit.

4.11 assembly Language, Assemblers And Compilers


A processor understands and executes machine instructions and these machine instructions are
simply binary numbers stored in the computer. If a programmer wished to program directly to the
machine, he would do it in machine language entering the data as binary data. Consider a simple
statement in BASIC
N=I+J+K
Suppose we wished to program this statement in machine language and to initialize I, J and K to 2, 3,
and 4 respectively. Suppose also that memory is reserved for the 4 variables starting at location 201
(Hex). The program will load the content of 201 [I] into the AC, ADD the contents of 202 [J] to the AC,
ADD the contents of 203 [K] to the AC, and then store the contents of the AC [which is the sum of I, J
and K] in 204. This apparently is a tedious and very error-prone process when it is being done by
hand by a programmer.
A slight improvement is to write the program in hexadecimal rather than binary. The program could
contain a series of lines and each line could contain the address of the memory location and the
hexadecimal code of the binary value to be stored in that location. A program will be needed that
will accept this input, translate each line into a binary number and then store it in the specified
location.
This approach can be improved further by giving memory locations some pseudonyms or
mnemonics although absolute addressing techniques must still be employed in order to store each
word. This means that both program and data will need to be stored in one place in memory and the
challenge is that the programmer must know the address of that place beforehand. Worse, suppose
changes need to be made to the program in a certain place within the program, this means that
changes will have to be made to all subsequent words in the program. A much better system is to
use symbolic addresses. In using symbolic addresses, a symbol denoting that address is used instead
of an absolute numerical address. With this kind of refinement, we have an assembly language.
Programs written in assembly language are translated into machine language by an assembler. This

60 | P a g e

program, as well as do symbolic translations, also does some form of memory address allocations to
symbolic addresses. The development of assembly language was a major milestone in the evolution
of computer technology and was the first step to the high-level languages that are in use today.
Although few programmers use assembly language today, virtually all machines provide one.
Assembly language interacts with systems programs such as compilers and I/O routines.

Self-Assessment Mini Project


Assume the following expression: F = (((abc + ab + bc). a + b).c) + abc. Use a Truth Table to reduce it
to its most simplified SOP form which is F = bc. Show all working.
1 Draw the logic circuit of the original expression in F above.[F = (((abc + ab + bc). a + b).c)].
2 The Truth Table will give you an intermediate solution which you will have to simplify to F = bc
using the Boolean Algebraic rules introduced in section 4.9. Draw the logic circuit of this
intermediate function.
3 Finally draw the logic circuit of F = bc.

61 | P a g e

5. The Memory-Cache Hierarchy


Section Outcomes
At the end of this section, the learner should be able to:

Depict, represent and explain the general computer memory-cache hierarchical model
Relate to and understand cache architecture and the different ways in which cache is
implemented in relation to main memory
Demonstrate an understanding of cache memory organization
Demonstrate a good understanding of what Internal Memory is and identify the common
types of internal memory that are commonly used
Be conversant with external memory, the types thereof and the different types and levels of
redundancy that can be implemented on them
Obtain a broad understanding of I/O and comprehensively explain the three main
techniques used for I/O implementation and organization

Cache represents a small high speed memory usually Static RAM (SRAM) that contains the most
recently accessed pieces of main memory. Why is this high speed memory necessary or beneficial?
In todays systems, the time it takes to bring an instruction (or piece of data) into the processor is
very long when compared to the time to execute the instruction. For example, a typical access time
for DRAM is 60ns. A 100 MHz processor can execute most instructions in 1 CLK or 10 ns. Therefore a
bottle neck forms at the input to the processor.
Cache memory helps by decreasing the time it takes to move information to and from the processor.
A typical access time for SRAM is 15 ns. Therefore cache memory allows small portions of main
memory to be accessed 3 to 4 times faster than DRAM (main memory). How can such a small piece
of high speed memory improve system performance?
The theory that explains this performance is called Locality of Reference. The concept is that at
any given time the processor will be accessing memory in a small or localized region of memory. The
cache loads this region allowing the processor to access the memory region faster. How well does
this work?
In a typical application, the internal 16K-byte cache of a Pentium processor contains over 90% of
the addresses requested by the processor. This means that over 90% of the memory accesses occurs
out of the high speed cache. So now the question, why not replace main memory DRAM with SRAM?
The main reason is cost. SRAM is several times more expensive than DRAM. Also, SRAM consumes
more power and is less dense than DRAM. Now that the reason for cache has been established, let
us look at a simplified model of a cache system.

62 | P a g e

5.1 Basic Model

CPU

Cache
memory
B
U
S

Main DRAM
Memory

System Interface

Figure 5.1 Basic Cache Model


Figure 5.1 shows a simplified diagram of a system with cache. In this system, every time the CPU
performs a read or write, the cache may intercept the bus transaction, allowing the cache to
decrease the response time of the system. Before discussing this cache model, let us define some of
the common terms used when talking about cache:
Cache Hits: When the cache contains the information requested, the transaction is said to
be a cache hit.
Cache Miss: When the cache does not contain the information requested, the transaction is
said to be a cache miss.
Cache Consistency: Since cache is a photo or copy of a small piece main memory, it is
important that the cache always reflects what is in main memory.
Some common terms used to describe the process of maintaining cache consistency are:

63 | P a g e

Snoop: When a cache is watching the address lines for a transaction, this is called a snoop.
This function allows the cache to see if any transactions are accessing memory it contains
within itself.
Snarf: When a cache takes the information from the data lines, the cache is said to have
snarfed the data. This function allows the cache to be updated and thereby maintaining
consistency.

Snoop and snarf are the mechanisms the cache uses to maintain consistency. Two other terms are
commonly used to describe the inconsistencies in the cache data, these terms are:
Dirty Data: When data is modified within cache but not modified in main memory, the data
in the cache is called dirty data.
Stale Data: When data is modified within main memory but not modified in cache, the data
in the cache is called stale data.
5.2 Cache Architecture
Caches have two characteristics, [1] a read architecture and, [2] a write policy. The read architecture
may be either Look Aside or Look Through. The write policy may be either Write-Back or
Write-Through. Both types of read architectures may have either type of write policy, depending
on the design. Write policies will be described in more detail in the next section. Let us examine the
read architecture now.

64 | P a g e

CPU

SRAM

B
U

Cache Controller

Tag RAM

System Interface

Figure 5.2 Look-Aside Cache

Look-Aside Cache Read Architecture: Figure 5-2 shows a simple diagram of the Look-Aside
cache architecture. In this diagram, main memory is located opposite the system interface.
The discerning feature of this cache unit is that it sits in parallel with main memory. It is
important to notice that both the main memory and the cache see a bus cycle at the same
time. Hence the name look aside.
When the processor starts a read cycle, the cache checks to see if that address is a cache hit. A hit

results from a scenario where the cache contains the memory location, then the cache will
respond to the read cycle and terminate the bus cycle. A miss results from the scenario
where the cache does not contain the memory location, then main memory will respond to
the processor and terminate the bus cycle. The cache will snarf the data, so next time the
processor requests this data it will be a cache hit.
Look-Aside caches are less complex, which makes them less expensive. This architecture
also provides better response to a cache miss since both the DRAM and the cache see the
bus cycle at the same time. The drawback is that the processor cannot access cache while
another bus master is accessing main memory.

65 | P a g e

Look-Through Cache Read Architecture: Figure 5.3 shows a simple diagram of this cache
architecture. Again, main memory is located opposite the system interface. The discerning
feature of this cache unit is that it sits between the processor and main memory. It is
important to notice that cache sees the processors bus cycle before allowing it to pass on
to the system bus. When the processor starts a memory access, the cache checks to see if
that address is a cache hit. If its a HIT, the cache responds to the processors request
without starting an access to main memory. If its a MISS, the cache passes the bus cycle
onto the system bus. Main memory then responds to the processors request. Cache snarfs
the data so that next time the processor requests this data, it will be a cache hit.

This architecture allows the processor to run out of cache while another bus master is
accessing main memory, since the processor is isolated from the rest of the system.
However, this cache architecture is more complex because it must be able to control
accesses to the rest of the system. The increase in complexity increases the cost. Another
down side is that memory accesses on cache misses are slower because main memory is not
accessed until after the cache is checked. This is not an issue if the cache has a high hit rate
and there are other bus masters. Figure 5.3 shows a depiction of the Look-Through cache
Read Architecture.

5.3 Write Policy


A write policy determines how the cache deals with a write cycle. The two common write policies
are Write-Back and Write-Through. In Write-Back policy, the cache acts like a buffer. That is, when
the processor starts a write cycle the cache receives the data and terminates the cycle. The cache
then writes the data back to main memory when the system bus is available. This method provides
the greatest performance by allowing the processor to continue its tasks while main memory is
updated at a later time. However, controlling writes to main memory increase the caches
complexity and cost.
The second method is the Write-Through policy. As the name implies, the processor writes through
the cache to main memory. The cache may update its contents, however the write cycle does not
end until the data is stored into main memory. This method is less complex and therefore less
expensive to implement. The performance with a Write-Through policy is slower since the processor
must wait for main memory to accept the data.

66 | P a g e

CPU

SRAM

Cache Controller

Tag RAM

System Interface

Figure 5.3 Look-Through Cache Read Architecture


5.4 Cache Components
The cache sub-system can be divided into three functional blocks: SRAM, Tag RAM, and the Cache
Controller. In actual designs, these blocks may be implemented by multiple chips or all may be
integrated into a single chip.

SRAM: Static Random Access Memory (SRAM) is the memory block which holds the data.
The size of the SRAM determines the size of the cache.

Tag RAM: Tag RAM (TRAM) is a small piece of SRAM that stores the addresses of the data
that is stored in the SRAM.

Cache Controller: The cache controller is the brains behind the cache. Its responsibilities
include: performing the snoops and snarfs, updating the SRAM and TRAM and implementing
the write policy. The cache controller is also responsible for determining if memory request
is cacheable and if a request is a cache hit or miss.

67 | P a g e

5.5 Cache Organization


In order to fully understand how caches can be organized, two terms need to be defined. These
terms are cache page and cache line. Let us start by defining a cache page. Main memory is divided
into equal pieces called cache pages. The size of a page is dependent on the size of the cache and
how the cache is organized. A cache page is broken into smaller pieces, each called a cache line. The
size of a cache line is determined by both the processor and the cache design. Figure 5.4 shows how
main memory can be broken into cache pages and how each cache page is divided into cache lines.
We will discuss cache organizations and how to determine the size of a cache page.

Cache Line
Cache Line
Cache Line
.
.
.
Cache Line
Cache Line

Figure 5.4 Cache Page

Cache Page

Cache Page

Cache Page
.
.
.
Cache Page

Cache Page

5.6 Fully Associative Cache Organization


The first cache organization to be discussed is Fully-Associative cache. Figure 5.5 below shows a
diagram of a Fully Associative cache. This organizational scheme allows any line in main memory to
be stored at any location in the cache. Fully-Associative cache does not use cache pages, only lines.
Main memory and cache memory are both divided into lines of equal size. For example Figure 5.5
shows that Line 1 of main memory is stored in Line 0 of cache. However this is not the only
possibility, Line 1 could have been stored anywhere within the cache. Any cache line may store any
memory line, hence the name, Fully Associative.

68 | P a g e

Main Memory
Line m
.
.
.
.
Line 4
Line 3
Line 2
Line 1
Line 0

Cache
Line m
.
.
.
Line 1
Line 0

Figure 5.5 Fully Associative Cache


A Fully Associative scheme provides the best performance because any memory location can be
stored at any cache location. The disadvantage is the complexity of implementing this scheme. The
complexity comes from having to determine if the requested data is present in cache. In order to
meet the timing requirements, the current address must be compared with all the addresses present
in the TRAM. This requires a very large number of comparators that increase the complexity and cost
of implementing large caches. Therefore, this type of cache is usually only used for small caches,
typically less than 4K.
5.7 Direct Map Cache Organization
Direct Mapped cache is also referred to as 1-Way Set Associative Cache. In this scheme, main
memory is divided into cache pages. The size of each page is equal to the size of the cache. Unlike
the fully associative cache, the direct map cache may only store a specific line of memory within the
same line of cache. For example, Line 0 of any page in memory must be stored in Line 0 of cache
memory. Therefore if Line 0 of Page 0 is stored within the cache and if Line 0 of page 1 is requested,
then Line 0 of Page 0 will be replaced with Line 0 of Page 1. This scheme directly maps a memory line
into an equivalent cache line, hence the name Direct Mapped cache. A Direct Mapped cache scheme
is the least complex of all three caching schemes. Direct Mapped cache only requires that the
current requested address be compared with only one cache address. Since this implementation is
less complex, it is far less expensive than the other caching schemes. The disadvantage is that Direct
Mapped cache is far less flexible making the performance much lower, especially when jumping
between cache pages.
5.8 Set Associative Cache Organization
A Set-Associative cache scheme is a combination of Fully-Associative and Direct Mapped caching
schemes. A set-associate scheme works by dividing the cache SRAM into equal sections (2 or 4
sections typically) called cache ways. The cache page size is equal to the size of the cache way. Each
cache way is treated like a small direct mapped cache. In this scheme, two lines of memory may be
stored at any time. This helps to reduce the number of times the cache line data is written-over. This
scheme is less complex than a Fully-Associative cache because the number of comparators is equal
to the number of cache ways. A 2-Way Set-Associate cache only requires two comparators making
this scheme less expensive than a fully-associative scheme.

69 | P a g e

5.9 The Pentium Processor Cache


This section examines internal cache on the Pentium processor. The purpose of this section is to
describe the cache scheme that the Pentium processor uses and to provide an overview of how the
Pentium processor maintains cache consistency within a system. The above section broke cache
into neat little categories. However, in actual implementations, cache is often a series of
combinations of all the above mentioned categories. The concepts are the same, only the
boundaries are different. Pentium processor cache is implemented differently than the systems
shown in the previous examples. The first difference is the cache system is internal to the processor,
i.e. integrated into the part, therefore no external hardware is needed to take advantage of this
cache - helping to reduce the overall cost of the system. Another advantage is the speed of memory
request responses. For example, a 100MHz Pentium processor has an external bus speed of
66MHz. All external cache must operate at a maximum speed of 66MHz. However, an internal cache
operates at 100MHz. Not only does the internal cache respond faster, it also has a wider data
interface. An external interface is only 64-bits wide while the internal interface between the cache
and processor pre-fetch buffer is 256-bits wide. Therefore, a huge increase in performance is
possible by integrating the cache into the CPU. A third difference is that the cache is divided into two
separate pieces to improve performance - a data cache and a code cache, each at 8K. This division
allows both code and data to readily cross page boundaries without having to overwrite one
another.

CPU
L1 Cache
Memory

L2 Cache Memory

Main DRAM
Memory

System Interface

Figure 5.6 Pentium Processor With L2 Cache

70 | P a g e

When developing a system with a Pentium processor, it is common to add an external cache.
External cache is the second cache in a Pentium processor system, therefore it is called a Level 2 (or
L2) cache. The internal processor cache is referred to as a Level 1 (or L1) cache. The names L1 and L2
do not depend on where the cache is physically located ( i.e., internal or external). Rather, it
depends on what is first accessed by the processor (i.e. L1 cache is accessed before L2 whenever a
memory request is generated). Figure 5.6 shows how L1 and L2 caches relate to each other in a
Pentium processor system.
5.10 Pentium Cache Organization
Both caches are 2-way set-associative in structure. The cache line size is 32 bytes, or 256 bits.
A cache line is filled by a burst of four reads on the processors 64-bit data bus. Each cache way
contains 128 cache lines. The cache page size is 4K, or 128 lines.
5.11 Operating Modes
Unlike the cache systems discussed in the Overview Of Cache, the write policy on the Pentium
processor allows the software to control how the cache will function. The bits that control the cache
are the CD (Cache Disable) and NW (Not Write-Through) bits. As the name suggests, the CD bit
allows the user to disable the Pentium processors internal cache. When CD = 1, the cache is
disabled, CD = 0 cache is enabled. The NW bit allows the cache to be either write-through (NW = 0)
or write-back (NW = 1).
5.12 Cache Consistency
The Pentium processor maintains cache consistency with the MESI5 protocol. MESI is used to allow
the cache to decide if a memory entry should be updated or invalidated. With the Pentium
processor, two functions are performed to allow its internal cache to stay consistent, Snoop Cycles
and Cache Flushing. The Pentium processor snoops during memory transactions on the system bus.
That is, when another bus master performs a write, the Pentium processor snoops the address. If
the Pentium processor contains the data, the processor will schedule a write-back. Cache flushing is
the mechanism by which the Pentium processor clears its cache. A cache flush may result from
actions in either hardware or software. During a cache flush, the Pentium processor writes back all
modified (or dirty) data. It then invalidates its cache (i.e., makes all cache lines unavailable). After
the Pentium processor finishes its write-backs, it then generates a special bus cycle called the Flush
Acknowledge Cycle. This signal allows lower level caches, e.g. L2 caches, to flush their contents as
well.

71 | P a g e

Reading/Activity: Please read pages 129-158 in your prescribed book.

Self-Assessment
1 What are the differences among sequential access, direct access and random access?
2 What is the general relationship among access time, memory cost and capacity?
3 How does the principle of locality relate to the use of multiple memory levels?
4 What are the differences among direct mapping, associative mapping and set associative
mapping?
5 For a direct mapped cache, a main memory address is viewed as consisting of three fields. List and
define the fields.

5.13 Internal Memory


The basic element of semiconductor memory is the memory cell. Although a variation of
electronic technologies are used, all semiconductor memory cells share certain properties:
They exhibit two stable (or semi-stable) states which can be used to represent binary
1 and 0.
They are capable of being written into (at least once) to set the state.
They are capable of being read to sense the state.
A memory cell has three functional terminals capable of carrying an electrical signal:
The Select Terminal, as the name suggests, selects a memory cell for a read or a
write operation.
The Control Terminal indicates a read or a write. For writing, the terminal provides
and electrical signal that sets the state of the cell to either 1 or 0. For reading, that
cell is used for output of the cells state.
5.14 DRAM And SRAM
The most common type of memory that we will look at is Random Access Memory [RAM].
This is a bit of a misnomer because all types of memory that we will currently look at as
internal memory are random access. One distinguishing characteristic of RAM is that it is
possible to both write new data into the memory and to read data out of the memory very
easily and rapidly. Both the reading and writing of data into memory are accomplished with
the use of electrical signals.
The other distinguishing characteristic of RAM is that it is volatile. A RAM must be supplied
with constant power supply. If the power is interrupted, then the data is lost. Thus RAM can
be used only as a temporary storage. The two traditional forms of RAM used in computers
are DRAM and SRAM.

72 | P a g e

RAM technology is divided into two technologies: Dynamic and Static. A dynamic RAM
(DRAM) is made up of cells that store data as charge on capacitors. The presence or absence
of charge on a capacitor is interpreted as a binary 1 or 0 respectively. Because capacitors
have a tendency to discharge, dynamic RAMs require periodic charge refreshing even with
power continuously supplied, to maintain storage of the data. Although a DRAM is used to
store a digital value, it is only an analog device.
In contrast, a Static RAM [SRAM] is a digital device that uses the same logic elements that
are used by and in the processor. In a SRAM, bits are stored using traditional flip-flop logic
gates configurations. A static RAM will hold its store as long as power is supplied to it.
Both SRAM and DRAM are volatile but a SRAM cell is more complicated and larger than a
DRAM cell. This means that a DRAM is denser and less expensive than a corresponding
SRAM cell. On the other hand a DRAM requires a supporting refreshing circuitry.
5.15 Types Of ROM
As the name suggests, Read Only Memory [ROM] contains a permanent pattern of data that
cannot be changed. A ROM is nonvolatile, meaning that no power source is necessary to
maintain the bit values that are in memory. While it is possible to read a ROM, it is not
possible to write new data into it. An important application of ROM is microprogramming.
Other potential applications include:
Library subroutines for frequently wanted functions
System programs
Function tables
For a modest-size requirement, the advantage of ROM is that the data or program is
permanently in main memory and need never be loaded from an external device. The ROM
is created like any other integrated circuit chip with the data actually wired into the chip as
part of the fabrication process. This presents two problems:
The data insertion step includes a relatively large fixed cost whether one of
thousands of copies of a particular ROM are fabricated.
There is no room for error. If one bit is wrong, the whole batch of ROMs must be
thrown out.
When only a small number of ROMs with a particular memory content is required, a less
expensive alternative is the programmable ROM [PROM]. Like the ROM, the PROM is
nonvolatile and may be written into only once. For the PROM, the writing process is
performed electrically and may be performed by a supplier or customer at a time later than
the original chip fabrication. Special equipment however is required for the writing or
programming process. PROMs provide flexibility and convenience but the ROMs remain
attractive for high-volume production runs.
Another variation of ROM is the read-mostly memory which is useful for applications in
which operations are far more frequent than write operations but for which nonvolatile
storage is required. There are three common types of read-mostly memory: EPROM,
EEPROM and flash memory.

73 | P a g e

The optically Erasable PROM [EPROM] is read and written electrically as with PROM.
However, for a write operation, all the storage cells must be erased to the same initial state
by exposure of the packaged chip to ultraviolet radiation. Erasure is performed by shining an
intense ultraviolet light through a window that is designed into the memory chip. This
erasure process can be performed repeatedly and each erasure can take up to 20 minutes
to complete. Thus, the EPROM can be altered multiple times and like the ROM and PROM, it
holds its data indefinitely. For comparable amounts of data, the EPROM is more expensive
than the PROM but has the advantage of multiple update capability.
A more attractive form of read-mostly memory is Electrically Erasable Programmable Read
Only Memory [EEPROM]. This is a read-mostly memory that can be written into at any time
without having to erase prior contents because the byte or bytes addressed are updated.
The EEPROM is therefore nonvolatile and flexibly updatable.
The last form of semiconductor memory is flash memory. It is so named because of the
speed with which it can be programmed. Like EEPROM, flash memory uses electrical erasing
technology. An entire flash memory can be erased in one to a few seconds which is much
faster than EEPROM. It is also possible to erase just a few blocks of memory rather than an
entire chip. Flash memory gets its name because the microchip is organized in such a
manner that it is possible to erase a section of memory cells in a single action or flash. Flash
memory however does not provide byte-level erasure.

Reading/Activity: Please read pages 191-197 in your prescribed book.

Self-Assessment
1 What are the key properties of semiconductor memory?
2 what are the two senses in which the term Random Access Memory is used?
3 What is the difference between DRAM and SRAM in terms of application?
4 What is the difference between DRAM and SRAM in terms of characteristics such as speed, size
and cost?
5 List some applications of ROM.
6 Explain why one type of ROM is considered to be analog and the other digital.
7 Explain the differences between EPROM, EEPROM and flash memory
8 How does SDRAM differ from ordinary DRAM?

74 | P a g e

5.16 External Memory


Magnetic disks are the foundation of all external memory on virtually every computer
system. Both removable and fixed disks are used in computer systems from personal
computers to mainframe computers and/or supercomputers.
Fundamentally, the disk is a metal or plastic platter coated with magnetisable material. The
data is recorded/written onto the platter and later read from the disk using a conducting
coil which is popularly known as the read/write head. On the disk, data is organized into
concentric rings and these rings are called tracks. The tracks are separated by gaps of space
but data is written on the tracks. The data density is higher on the inner tracks, this however
does not necessarily mean that there is more data on the inner tracks. Data is mainly dense
on the inner tracks because these are much smaller in terms of their circumference than the
outer tracks. The logical data transfer unit on the disk is a sector.
5.17 Disk Characteristics
A disk drive may hold more than one platter. If this is the case, each platter will have its own
read/write head. The read/write head is either fixed or movable and if a fixed configuration
is used, then there will be a head per every track on the disk otherwise there is one head
per platter. If a platter is removable, it may, if desired, be taken out of the machine for
storage or transfer to another machine.
5.18 RAID Technology
Raid stands for Redundant Array of Independent Disks. This technology can raise system
performance levels based mainly on a higher system availability that it helps foster. RAID
refers to a family of techniques that uses multiple disks as a parallel array of data storage
devices with redundancy built in to compensate for disk failure. This technique helps in
assisting the system to fully harness the capacity of a high performance disk drive which
then is duplicated by RAID to operate many disks in parallel all with simultaneous access. In
this configuration, data is duplicated across all disks and if one of the disks fails, the system
will continue operating as though nothing at all has happened as redundancy kicks in and
the user who was using the malfunctioning disk will just be shifted to another without his
even noticing it. The possibility of data loss is eliminated because of the duplicated array.
There are 7 levels of RAID:
RAID 0: in which no redundancy techniques are used. In this configuration, data is
distributed over all disks in the array where it is divided into strips for storage. The
storage technique is similar to interleaved memory data storage and supports high
data transfer rates by implementing block transfers the size of which singular block
will be multiples of a strip. Low transfer rates can also be supported by
implementing a block transfer size equal to just one strip.
RAID 1: In this implementation, all disks are mirrored. This means that duplicated
data is stored on a disk and its mirror. Data is read from either the disk or its mirror
but writing is always done to both the disk and the mirror. When a fault occurs on
the disk, the mirror is used for immediate recovery. One drawback of this
implementation is that it is expensive.
RAID 2: In this implementation, all disks are used for every access [Read or Write]
and they are implemented to do each disk the same thing as every other, i.e. they
are synchronized. The data strips used in this implementation of RAID are small,

75 | P a g e

about a byte each. The error correction code is computed across all disks and stored
on (a) separate disk(s). This implementation uses fewer disks than RAID 1, but it is
still expensive.
RAID 3: This implementation is much like RAID 2 in its characteristics but only a
single redundant disk is used. This disk is called the parity drive. A parity bit is
computed for the full set of individual bits in the same position on all disks. If a drive
fails, the parity information on the redundant disk can be used to unscramble the
data from the failed disk on the fly.

5.19 Optical Disks


The advent of CDs in the early 1980s revolutionized the audio and computer industries. CDs
are operated using constant linear velocity. Essentially one long track spirals onto disk and
this track passes under the disks head at a constant rate. This requires the disk to change
rotational speed depending on what part of the disk you are on. To write onto the disk, a
laser is used to burn pits into the track and this write is a write once. During a read, a low
laser illuminates the track and its pits. On the track, pits reflect light differently than no-pits
and this allows the CD to store 1s and 0s.
5.20 Magnetic Tape
This represents the first kind of magnetic memory and is still very widely used in many
installations. Magnetic tape is very cheap and you pay for that through its low speed
differential and the fact that access is not Random but sequential. Data is organized as
records with physical air gaps separating the records. One word is stored across the width of
the tape and read using multiple read/write heads.
5.21 Write Once, Read Many Disks
These allow the user to produce CD-ROMs in limited quantities. Specially prepared disk is
written to using a medium power laser. It can be read many times just like a normal CD
ROM. Permits archival storage of information and distribution of large amounts of
information by a user.

Reading/Activity: Please read pages 203-230 in your prescribed book.

Self-Assessment
1 Define the terms track, cylinder and sector.
2 What common characteristics are shared by all RAID levels?
3 What is the typical disk sector size?
4 How is data read from a magnetic disk?
5 Briefly define the seven RAID levels.

76 | P a g e

5.22 Input / Output


The computer systems I/O architecture is its interface to the outside world. The components that
make up this interface comprise of buses, keyboards, CRT monitors, flat panel display units,
scanners, modems, mouse devices, pens, printers, etc. The buses make up the backbone of the I/O
architecture because every I/O module interfaces to the computers bus system meaning that in a
broad sense, the bus system stands in the center between the outside world and the whole I/O
modular interface.

There are three principal I/O techniques:


Programmed I/O
Interrupt-Driven I/O, and
Direct Memory Access [DMA]
5.23 Buses
The collection of paths that connect the system modules together form the interconnection
structure and the integral part of this system is the bus architecture. Buses provide data
pathways between components of the computer system. The buses are essentially shared
transmission media connecting 2 or more devices together. Buses are the main component
of the system architecture that facilitates the broadcast of signals from one component to
many other components in one data signal output burst [one-to-many broadcasting]. The
bus architecture ensures that only one device places information onto the bus at any one
time.
Typical buses consist of 50 to 100 lines. These lines carry various types of relevant
information from component to component according to the task at hand. Address
information is conveyed by the address bus specifying the source and/or destination in the
data transfer. Data is conveyed by the data bus. The bus width is key in determining overall
performance. The extent of the width is delimited by the number of lines that constitute the
bus.
Aggregate demand for access to the bus from all devices connected to the bus may limit
performance of the bus because there will be delays since the bus will keep listening to
device interrupts which it may have to accommodate [if there are free lines] or to turn
down by returning a busy signal if all its lines are busy. In order to obviate the problem of
delays and bus swamping, a multiple bus system can be implemented. The computer
implements a hierarchical multiple bus structure in which the high-speed buses which
enforce limited access by I/O modules are located closer to the processor and lower-speed
buses which tolerate general access are located further away from the processor. All these
buses are then connected to the external bus, which is normally referred to as the PC bus.
The first open system bus architecture that was implemented for PCs [Small computers
based on the architecture around which the first IBM machines were built] was the
International Standard Architecture [ISA] and this architecture came with 8-bit and 16-bit
ISA buses. Other bus architectural implementations included Micro Channel Architecture
and PCI. Figure 5.7 shows a depiction of the ISA Bus. Figure 5.8 shows a depiction of
Hierarchical Bus Configurations.

77 | P a g e

Figure 5.7 ISA Bus


5.24 Programmed Input/Output
In this technique, the processor executes an I/O instruction by issuing a command to the
appropriate I/O module. The I/O module then responds by performing the requested action
and sets the appropriate bits in the I/O register. The module takes no further action to alert
the processor. It does not interrupt the processor. The processor now has the task of
periodically checking the status of the I/O module until it determines that the operation is
complete.
Memory-mapped I/O: This is a technique that implements a single address space for
both memory and I/O devices and the obvious disadvantage is that this technique
wastes/consumes the much-needed addressing space available to the programmer.
In this technique, I/O modules are treated as memory addresses. The same machine
instructions are used to access both memory and I/O devices and the advantage is
that it allows for more efficient programming.
Isolated I/O: Separates the address spaces for memory and I/O devices. This
technique uses a small number of I/O instructions and is almost as commonly used
as Memory-Mapped I/O.
5.25 Interrupt-Driven I/O
This technique overcomes the need for the processor to have to wait for long periods of
time for I/O modules. The process does not have to expend loads of time checking to see if
the I/O module has finally completed a task. In this technique:

From The I/O Modules Point-Of-View: The module receives a READ command from
the processor and the I/O module reads the data from the desired peripheral into
the data register. After this, the I/O module interrupts the processor to notify the
processor that it has completed the task enforced upon it by the processor. The I/O
module enters wait mode to wait for the processor to request for the data and at
this request the I/O module places the data on the data bus.

78 | P a g e

Figure 5.8 Hierarchical Bus Configurations

79 | P a g e

From The Processors Point-Of-View: The processor issues a READ command and
continues immediately with yet another useful task. The processor then keeps track
of the instruction cycle, and at the end it checks for interrupts. If it finally finds the
presence of the relevant interrupt, it saves the current I/O interrupt context. The
processor then reads the data from the I/O module and writes it in memory. The
processor restores the saved context and resumes execution.
Design Issues: How does the processor determine which device issued the
interrupt? How are multiple interrupts dealt with? There are ways of identifying the
interrupting device:
o Software Poll The processor polls each I/O module with a separate
command line to test if it is the interrupting I/O module. Processor does this
by reading the status register of the I/O module. This technique is time
consuming.
o Daisy Chain Hardware Poll This technique uses a common interrupt
request line for all I/O modules. Processor sends interrupt
acknowledgement. The requesting I/O module places a word of data on the
data lines, and this data word on the data lines is called a vector and its job is
to uniquely identify the I/O module. This kind of interrupt is called a vectored
interrupt.
o Bus Arbitration I/O module first gains control of the bus and then sends an
interrupt request to the processor. The processor acknowledges the
interrupt request and the I/O module places its vector on the data lines.
Multiple Interrupts: The techniques above not only identify requesting I/O modules
but also provide methods of assigning priority. Where there are multiple
interrupting lines, the processor identifies the line with the highest priority and picks
that line.
o Multiple Lines - The line with the highest priority may be the one that has
just placed a vector containing the data that is needed in the current or next
processor execution cycle.
o Software Polling Polling Order determines priority
o Daisy Chain Daisy Chain order of the modules determines priority
o Bus Arbitration Arbitration scheme determines priority

5.26 Direct Memory Access


The drawback of the techniques that we have looked at so far, that is, Programmed and
Interrupt-Driven I/O is that I/O transfer rate is limited to the speed at which the processor
tests and services devices which processes tie up the processor while it manages I/O
transfers.
This technique implements a DMA module which resides on the system bus. This module is
used to mimic the functions of the processor and only uses the system bus when the
processor does not need it. The DMA module has the capability to force the processor to
suspend an operation in a technique called Cycle Stealing.

DMA Operation: The processor issues a command to the DMA module to READ or
WRITE an I/O device address using data lines. The starting [first] memory address is
stored in the address register. The number of words to be transferred using the data

80 | P a g e

lines is stored in the data register. After this, the processor continues with other
work. The DMA transfers the entire block of data one word at a time directly to or
from memory without going through the processor and when the process is
complete, the DMA module sends an interrupt to the processor. Just before the
processor needs the bus, it is suspended and when the DMA transfers one word, it
then returns control to the processor. Since this is not an interrupt, the processor
does not have to save context. The processor executes more slowly, but this is still
far more efficient than either programmed or interrupt-driven I/O.
The DMA architecture can be implemented using a single bus configuration and this single
bus is the System bus. Another variation is the use of a special I/O bus which is the second
bus in addition to the main System bus:

Single Bus Detached DMA Module. In this configuration, each transfer uses the bus
twice I/O to DMA and DMA to memory. The processor is suspended twice.

Single Bus Integrated DMA Module. In this configuration more than one I/O device
may be supported and each transfer uses the bus once DMA to memory and the
processor is suspended just once.

Separate I/O Bus The I/O bus supports all DMA-enabled devices and each transfer
uses the bus once DMA to memory and the processor is suspended once.

81 | P a g e

Reading/Activity: Please read pages 235-272 in your prescribed book.

Self-Assessment
1 List three broad classifications of external or peripheral devices.
2 What is the internal reference alphabet?
3 What are the major functions of an I/O module?
4 List and briefly define three techniques for performing I/O
5 Explain the difference between memory-mapped and isolated I/O.
6 When a device interrupt occurs, how does the processor determine which device issued the
interrupt?
7 When a DMA module takes control of a bus, and while it retains control of the bus, what does the
processor do?

82 | P a g e

6. Operating Systems Support


Section Outcomes
At the end of this section, the learner should be able to:

Explain the meaning of pipelining and relate to how the concept of pipelining takes
advantage of the multi-stage nature of the basic instruction cycle to maximize resource
utilization
Understand the constraints that undermine the effectiveness of pipelining and how they are
treated to avoid performance drawbacks
Understand pipeline performance and the limitations thereof
Understand, explain, compare and contrast the characteristics, advantages and limitations
of CISC and RISC architectures
Understand Instruction-Level Parallelism and Superscalar Processors

An instruction cycle has a number of stages and it is precisely this fact that makes it possible
for the pipelining strategy to be profitable in the processing of instructions. In a factory,
products at different stages of production can be simultaneously worked on by laying out
the production process in an assembly line. The assembly line philosophy cab be used to
effectively explain how pipelining works.
As a simple example, consider that an instruction has only two stages: the fetch instruction
and the execute instruction. There are times during the execute instruction when main
memory is not being accessed. This time could be used for fetching the next instruction in
parallel with the execution of the current one. The pipeline has two independent stages.
The first stage fetches an instruction and buffers it. When the second stage is free, the first
stage passes it the buffered instruction. While the second stage executes the instruction,
the first stage takes advantage of the fact that memory is not being used and fetches an
instruction and buffers it. This is called instruction prefetch and fetch overlap. In general,
pipelining requires a lot of registers to store data between stages. Figure 6.1 depicts a TwoStage Instruction Pipeline.

Figure 6.1 Two-Stage Instruction Pipeline

83 | P a g e

At this stage it becomes necessary for us to decompose the instruction cycle as follows:
Fetch Instruction (FI): Read the next expected instruction into the buffer
Decode Instruction (DI): Determine the opcode and the operand specifiers
Calculate Operands (CO): Calculate the address of each source operand. This may
involve displacement, register indirect, indirect or other forms of address
calculations
Fetch Operands (FO): Fetch each operand from memory. Operands already in
register need not be fetched.
Execute Instruction (EI): Perform the indicated instruction and store the result, if
any, in the specified destination operand location.
Write Operand (WO): Store the result in memory.
If we use the assumption that each of the above stages can be performed over the exact
same duration, it can be shown that a six-stage pipeline can reduce the execution time for
nine instructions from 54 time units to 14 time units.

Figure 6.2: Timing Diagram A For Six-Stage Instruction Pipeline Operation Involving 9
Instructions.

84 | P a g e

The diagram engenders the assumptions that all stages can be performed in parallel, and
also that there are no memory conflicts. In this environment, the processor makes use of
pipelining to speed up executions while at the same time breaking up the instruction cycle
into a number of separate stages in a sequence. However, the occurrence of branches and
independences between instructions complicates the design and implementation of
pipelining.
6.1 Pipeline Performance And Limitations
A good design goal of any system is that it have all its components performing useful work
at any time so that a high efficiency is obtainable. In this section we develop some simple
measures of pipeline performance and relative speedup. The time cycle T of a pipeline is the
time required to advance a set of instructions one stage through the pipeline. The cycle
time, T, can be determined as:
T = max[Ti] + d = Tm + d

1<=1<=k

Where:
Ti = The time delay of the circuitry in the ith stage of the pipeline
Tm = The maximum delay experienced of all the i stages of the pipeline. This value is
the highest Ti value.
k = The number of stages in the instruction pipeline
d = The time-delay of a latch; that is the delay experienced in advancing data and
signal from one stage to the next
In general, the time delay, d is equivalent to a clock pulse and Tm>>d. Now, suppose that n
instructions are processed, with no branches. Let Tk, n be the total time required for a
pipeline with k stages to execute n instructions. Then,
Tk, n = [k + (n-1)]T.
A total of k cycles is required to execute the first instruction and the remaining n-1
instructions require n-1 cycles.
Now consider a processor with equivalent functions but no pipeline and assume that the
instruction cycle time is kt. The speedup factor for the instruction pipeline compared to
execution without the pipeline is defined as:
Sk = T1, n =
Tk, n

nkT______
[k + (n-1)]T

nk______
k + (n-1)

6.2 Pipeline Limitations


Several factors exist that limit the pipeline performance. If the six stages are not of equal
duration, there will be some waiting involved at various stages of the pipeline. Another
condition which can cause delays or at least some behavioral unpredictability is the
conditional branch or an interrupt. Memory conflicts also occur from time to time. The
system must therefore contain logic to deal with these conflicts in order to maintain system
efficiency and predictability. The stalling of a pipeline caused by the above reasons is called

85 | P a g e

a pipeline hazard. A pipeline hazard occurs when the pipeline or some portion of the
pipeline must stall because conditions do not permit continued execution. A stall can also be
referred to as a pipeline bubble. There are three types of hazards: resource, data and
control.
Resource Hazard: This type of hazard occurs when two or more instructions that are
already in the pipeline require the same resource. The result is that the instructions
will then have to be executed rather in series (serially, that is, one at a time or one
after the other) than in parallel for a portion of the pipeline. A resource hazard is
sometimes referred to as a structural hazard.
Data Hazard: A data hazard occurs when there is a conflict in the access of an
operand location. Consider two instructions of a program that are to be executed in
sequence and both access a particular memory or register operand. If the two
instructions are executed in strict sequence, no problem occurs. However if the
instructions are executed in a pipeline, it is possible that the operand value may be
updated in such a way that it will produce a different result than would occur if the
instructions were being executed sequentially. In other words, the program
produces an incorrect result because of the use of a pipeline.
Control Hazard: A control Hazard, also known as a branch hazard, occurs when the
pipeline makes the wrong decision on a branch prediction and therefore brings
instructions into the pipeline that must subsequently be discarded.

Think Point: It is important to note that a pipeline executes an instruction in stages, and
these stages normally reflect the number of execution steps to complete the execution of a single
instruction. A pipeline facilitates the execution of multiple instructions through the various stages or
phases of execution for each instruction. A pipeline therefore ensures that several instructions are
executed all at once, enhancing system performance.

Reading/Activity: Please read pages 462-479 in your prescribed book to consolidate what
you already know up to this stage about Pipelining.

86 | P a g e

6.3 Reduced Instruction Set Computers [RISCs]


These are architectures that make use of a reduced and simplified instruction set as
opposed to a Complex Instruction Set as in Complex Instruction Set Computers [CISCs] such
as desktops, servers and supercomputers. The RISC architecture is normally employed in
small gadgets and circuits such as those that run smartphones and tablet computers. RISC is
one of the few true innovations in computer organization and architecture in the last 50 or
so years of high-end computing. The key elements common to most designs are:
They implement a limited and simple instruction set
The organization involves a large number of registers and the use of compiler
technology to optimize register usage
Great emphasis is on optimizing the instruction pipeline
6.4 Instruction Execution Characteristics
In modern day computing, there has been the existence of what is known as the semantic
gap, which is the difference between the operations provided in high level languages and
those provided in computer architecture. The semantic gap could be identified by symptoms
of execution ineffiency during processes, and at a programming level: excessive machine
program size and compiler complexity.
New RISC designs introduced features to try and close this gap which included large
instruction sets, dozens of addressing modes and various high level language statements
introduced at hardware level. These designs were meant to make compiler-writing easier
which would also provide support for even more complex and sophisticated high level
languages. RISC architecture made it possible to implement complex sequences of
operations in microcode which innovation provided the reality to improve instruction
execution and efficiency.
To understand the reasoning of RISC advocates, we look at study results on three main
aspects of computation:
Operations Performed: Representing the functions to be performed by the CPU and
its interaction with memory
Operand Used: Sheds light on the types of operands and their frequency of use. The
types of operands and their frequency of use determine/influence memory
organization and addressing modes.
Execution Sequencing: Sheds light on the control and organization of the pipeline.
Study results are based on dynamic measurements [during program execution] to enable a
true reflection of the effects on performance:
On Operations
o Simple counting of statement frequency indicates that assignment [data
movement] predominates followed by iteration and/or selection
o Weighted studies show that call/return actually accounts for the most work
o The study also looked at the dynamic frequency of classes of variables.
Results showed a preponderance of reference to highly localized scalars:
Majority of references are to simple scalars
Over 80% of the scalars were local variables

87 | P a g e

References to array and other data structures require a previous


reference to their index or pointer which is usually a local scalar

On Operands:
o Another study showed that each instruction [DEC-10, in this case] references
0.5 operands in memory and 1.4 in registers
o Implications:
Need for fast operand accessing
Need for optimized mechanisms for storing and accessing local scalar
variables

Execution Sequencing:
o Subroutine calls are the most-time-consuming operations in High Level
Languages.
o There is an identified need to minimize their impact by:
Streamlining parameter passing
Efficient access to local variables
Support nested subroutine invocation
o Statistics:
98% of dynamically called procedures passed fewer than 6
parameters
92% use less than 6 local scalar variables
Rare to have a long sequence of subroutine calls followed by returns
[e.g. a recursive sorting algorithm]
Typically, the depth of nesting was very low.

Implications: The results above suggest that reducing the semantic gap through the
use of complex architectures may not be the most efficient use of system hardware.
The need also exists to optimize machine design based on the most time-consuming
tasks of typical high-level language programs. The necessity for the use of large
numbers of registers to reduce memory references by keeping variables close to the
CPU is made apparent by the findings of the studies under the spotlight. The findings
of the studies also argue for the need to streamline instruction sets rather than
make them more complex.

6.5 Large Register Files


The question that is relevant to ask at this juncture is, How can we make programs use
registers more often? This we ask because RISC advocates have already made the case for
there to exist the use of a large number of registers. This question can be answered from
two fronts: [1] From the software standpoint and [2] From the hardware standpoint:
Software: One way of achieving maximum register utilization by programs is to
ensure that compilers are optimized by building into compiler design the ability to
allocate registers to those variables that will be used most in a given time period.
This techniques is complex in that for it to be successful, it requires sophisticated
program-analysis algorithms.
Hardware: Design must be such that more registers are made available so that they
will be used more often by ordinary compilers.

88 | P a g e

Naively adding registers will not effectively reduce the need to access memory. Since most
operand references are to local scalars, it is better to store them in registers, with perhaps a
few global variables. The problem with this assumption is that during program execution the
definition of local changes with each procedure or method call and return. It is therefore
better to use multiple small sets of registers, each assigned to a different procedure. Global
variables could just use memory, but this could be inefficient for frequently used globals. To
deal with this setback, it will be expedient to incorporate a set of global registers in the CPU.
In this scenario, the registers available to a procedure will be split: some will be the global
registers and the rest will be in the current window.
6.6 A Look At CISC
There are arguments that insist that Complex Instruction Set Computers [CISCs] are a more
effective and efficient option when it comes to getting the best out of a computer system
and advocates of this position argue that CISC tends to offer richer instruction sets and it
accommodates the writing and handling of more instructions in number as well as more
complex instructions to handle complex algorithms such as those involved in scientific and
actuarial work. Because CISC incorporates high-level language syntaxes, it does simplify
compilers and improves the performance of the system. CIS computing has the effect of
minimizing code size which in the end contributes to a reduced instruction execution count.
But because instruction execution count is low, pipelining as a processing technique is less
effective with CISC because pipelining produces more efficient results as the number of
instructions increases.
Under CISC, programs are smaller meaning that they will execute faster due to the fact that
smaller programs have fewer instructions requiring less instruction fetching which translates
to the advantage that they save memory expenditure in the process. In paged environments
smaller programs occupy fewer pages resulting in fewer page faults.
Though CISC programs may be shorter, the bits used for each instruction are more so total
memory used may not necessarily be smaller. This is due to that opcodes require more bits.
Operands also require more bits because they are usually memory addresses [in the
instruction structure], as opposed to register identifiers which are the usual case for RISC.
The CISC architecture incorporates a more complex Control Unit to accommodate seldom
used complex operations and because of this overhead, the more often used simple
operations take longer. The speedup for complex instructions may be mostly due to their
implementation as simpler instructions in microcode, which is similar to the speed of
simpler instructions in RISC [except that the CISC designer must decide priri which
instructions to speed up this way].
6.7 Characteristics Of RISC Architectures
Generally the RISC architecture institutes the processing of one instruction per cycle. A
machine cycle is defined by the time it takes to fetch two operands from registers, perform
an ALU operation and store the result in a register. RISC machine operations should be no
more complicated than, and execute about as fast as microinstructions on a CISC machine.
There is no microcoding that is needed, just simple instructions which will execute faster
than their CISC equivalents due to there being no need to access microprogram control
store. Virtually all machine operations in RISC are register-to-register operations. Only

89 | P a g e

simple LOAD and STORE operations access memory and this simplifies instruction set and
Control Unit design. This kind of design therefore encourages the optimization of register
use to be built into the architecture.
The RISC architecture uses simple addressing modes. Almost all instructions use simple
register addressing although a few other addressing modes such as displacement and PC
Relative may be provided. More complex addressing is implemented in software from the
simpler ones and this kind of design further simplifies the instruction set as well as the
Control Unit.
RISC also uses a few instruction formats which has the effect of building simplicity into the
Control Unit. Instruction length is fixed and aligned on word boundaries. This optimizes
instruction fetching since single instructions do not cross page boundaries. RISC enables
simultaneous opcode decoding and register operand access since field locations [especially
the opcode] are fixed.
The benefits of RISC design are that compilers are effectively optimized and the Control Unit
is in an overall sense simplified and a simpler CU can execute instructions much faster than
a comparable CISC unit. Instruction pipelining can be applied more effectively with a
reduced instruction set.
6.8 RISC Pipelining
The simplified structure of RISC instructions allows us to reconsider pipelining given that
most instructions are register-to-register translating into meaning that an instruction has
two phases [1] Fetch Instruction (FI) and [2] Execute Instruction (EI). The EI phase is an ALU
operation with register input and output. For LOAD and STORE operations, 3 stages are
needed: [1] FI [2] EI, and [3] M(emory). Because the EI phase usually involves an ALU
operation, it may be longer than the other phases and in this case we can divide it into two
sub phases: [1] EI1 Register File Read and, [2] EI2 ALU operation and Register write.
6.9 Optimization Of The Pipeline
Delayed Branch: We have seen that data and branch dependencies reduce the
overall execution rate in the pipeline. Delayed branch makes use of a branch that
does not take effect until after the execution of the following instruction. Note that
the branch takes effect during the EI phase of this following instruction so the
instruction location immediately following the branch is called the delay slot. This is
because the instruction-fetching order is not affected by the branch until the
instruction after the delay slot. Rather than wasting an instruction with a NOOP, it
may be possible to move the instruction preceding the branch to the delay slot while
still retaining program semantics.
Conditional Branches: If the instruction immediately preceding the branch cannot
alter the branch condition, this optimization can be used/applied otherwise a NOOP
is still required.
Delayed Load: On load instructions, the register to be loaded is locked by the
processor. The processor continues execution of the instruction stream until it
reaches an instruction needing a locked register. It then idles until the load is

90 | P a g e

complete. If LOAD takes a specific number of clock cycles, it may be possible to


arrange the instructions to avoid the idle.
6.10 Super-Pipelining
A Super-pipelined architecture is one that makes use of more, finer-grained pipeline stages.
In Super-pipelining, all instructions follow the same sequence of 5 pipeline stages with the
clock cycle required for an activity in that stage being divided into two equal phases. This
facilitates that the activities needed for each stage occur in parallel and may therefore not
use an entire stage. This means that the external instruction and data cache operations as
well as the ALU operations can each be broken up into two. In a super-pipelined system,
existing hardware is used several times per cycle by inserting pipeline registers to split up
each pipe stage. Each super-pipeline stage operates at a multiple of the base clock
frequency and this multiple depends on the degree of super-pipelining (the number of
phases into which each stage is split).
6.11 Instruction-Level Parallelism And Superscalar Processors
Superscalar refers to a machine that is designed to improve the performance of the
execution of scalar instructions. This is opposed to vector processors, which achieve
performance gains through parallel computation of elements of homogenous structures
such as vectors and arrays. The essence of the superscalar approach is the ability to execute
instructions independently in different pipelines and in an order different to the program
order. In general, there are multiple functional units, each of which is implemented as a
pipeline, which supports parallel execution of several instructions.

Reading/Activity: Please read pages 499-529 in your prescribed book to consolidate what
you already know up to this stage about Reduced Instruction Set Computing.

Assessment exercise
1 What are some typical distinguishing characteristics of RISC organization?
2 Briefly explain the two basic approaches used to minimize register-memory operations on RISC
machines
3 What are the typical characteristics of a RISC instruction set architecture?
4 What is a delayed branch?

91 | P a g e

References
Stallings, W., (2013), Computer Organization And Architecture, 8e. Pearson Education, NJ.
Englander, I., (2010), The Architecture Of Computer Hardware, Systems Software & Networking, 4e.
John Wiley & Sons, Asia.
Digital Electronics Basics: Logic Gates And Boolean Algebra, 2013. Available from:
http://www.ni.com/multisim/try/ [25 November 2013]
Dodge, NB 2012, Boolean Algebra And Combinational Data Logic. Available from:
http://www.utdallas.edu/~dodge/EE2310/lec4.pdf [November 2013]
Nguyen, THL 2009, Computer Architecture. Available from:
http://www.cnx.org/content/col10761/1.1/ [17 November 2013]
Introduction 2 An Overview Of Cache Available from:
http://www.download.intel.com/design/intarch/papers/cache6.pdf [28 November 2013]

92 | P a g e

Das könnte Ihnen auch gefallen