Aca Mod2&3 PDF

Advanced Computer Architecture
Chapter 4
Processors and Memory Hierarchy
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
Source diginotes.in Save the earth. Go paperless

In this chapter…
• Design Space of Processors

• Superscalar and Vector Processors
• Memory Hierarchy Technology
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Technology Source diginotes.in Save the earth. Go paperless
2
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Processor families mapped onto a coordinated space of clock rate
v/s CPI
o Clock Rates moved from lower
to higher speeds
o CPI rate is lowered
• Broad Categorization
o CISC
o RISC

3
• The Design Space
o CISC Computers
o RISC Computers
o Superscalar Processors
o VLIW Processors
o Vector Supercomputers

4
• Instruction Pipelines
o Instruction Cycle Phases
o Pipeline and Pipeline Cycle
o Instruction Pipeline Cycle
o Instruction issue Latency
o Instruction Issue Rate (degree of superscalar processor)
o Simple Operation Latency
o Resource Conflicts
o Base scalar processor

5

6

7

8

9

10

11

12

13

14
• Vector Processors
o Memory to memory VP
• Memory based instructions
• Longer instructions
• Instructions include memory address
o Register to register VP
• Shorter Instructions
• Vector register files

15
o Vector Instructions (Register to register)
• Binary Vector V1 o V2  V3
• Scaling s1 o V1  V2
• Binary Reduction V1 o V2  s1
• Vector Load M(1:n) V1
• Vector Store V1 M(1:n)
• Unary Vector o V1  V2
• Unary Reduction o V1  s1

16

17
o Vector Instructions (Memory to memory)
• M1(1:n) o M2(1:n)  M3(1:n)
• s1 o M1(1:n)  M2(1:n)
• o M2(1:n)  M(1:n)
• M1(1:n) o M2(1:n)  M(k)
o Vector Pipelines
• Symbolic Processors
o Prolog Processors, Lisp Processors or symbolic manipulators.
o Deals with logic programs, symbolic lists, objects, scripts, productions systems, semantic networks,
frames and artificial neural networks.

18

19
o Attributes & Characteristics of Symbolic Processing
• Knowledge Representation
o Lists, relational databases, semantic nets, frames etc.
• Common Operations
o Search, sort, pattern matching, filtering etc.
• Memory Requirements
o Large memory with intensive access pattern, content-based
• Communication Patterns
o Varying traffic size, granularity and format of messages
Continued…

20
o Attributes & Characteristics of Symbolic Processing
• Properties of Algorithms
o Non-deterministic, parallel and distributed computations
• I/O Requirements
o User-guided programs, intelligent person-machine interfaces
• Architecture Features
o Parallel update of knowledge bases, dynamic load balancing, dynamic memory
allocation, hardware based garbage collection etc.

21
Memory Hierarchy Technology
• Memory Hierarchy
- Need & Significance

22
o Parameters
• Access time
• Memory size
• Cost per byte
• Transfer bandwidth
• Unit of transfer
o Properties
• Inclusion
• Coherence
• Locality

23
o Parameters
• T(i): Access time (round-trip time from CPU to i-th level memory)
o T(i-1) < T(i) <T(i+1)
• S(i): Memory size (number of bytes or words in i-th level memory)
o S(i-1) < S(i) < S(i+1)
• C(i): Cost per byte (per byte cost of i-th level memory; total cost estimated by C(i)*S(i))
o C(i-1) > C(i) > C(i+1)
• B(i): Transfer bandwidth (rate at which information is transferred between adjacent levels)
o B(i-1) > B(i) > B(i+1)
• X(i): Unit of transfer (grain size for data transfer between levels I and i+1)
o X(i-1) < X(i) < X(i+1)

24
o Properties
• Inclusion Property
o M1 ⊂ M2 ⊂ M3 ⊂ ....⊂ Mn
o M(i-1) is a subset of M(i)
• Coherence Property
o Copies of same information item at successive memory levels be consistent
o Strategies to maintain Coherence:
• 1) Write-Through (WT)
• 2) Write-Back (WB)
• Locality of Reference
o Temporal: recently referenced items are likely to be referenced again in near future
o Spatial: tendency of a process to access items whose addresses are near one another
25
The Working Sets model

26
• Memory Capacity Planning
o Hit ratios
• 𝐡𝐢
o Access Frequency to ith level memory
• 𝐟𝐢 = 𝟏 − 𝐡𝟏 𝟏 − 𝐡𝟐 … 𝟏 − 𝐡𝐢−𝟏 𝐡𝐢
o Effective Access Time
• 𝐓𝐞𝐟𝐟 = σ𝒏𝒊=𝟏 𝒇𝒊 ∗ 𝒕𝒊
o Total Cost of Memory Hierarchy
• 𝐂𝐭𝐨𝐭𝐚𝐥 = σ𝒏𝒊=𝟏 𝒄𝒊 ∗ 𝒔𝒊
o Hierarchy Optimization
• 𝐂𝐭𝐨𝐭𝐚𝐥 = σ𝒏𝒊=𝟏 𝒄𝒊 ∗ 𝒔𝒊
• Subject to: si > 0 ti > 0 (for i=1 to n) 𝐂𝐭𝐨𝐭𝐚𝐥 = σ𝒏𝒊=𝟏 𝒄𝒊 ∗ 𝒔𝒊 < 𝑪𝟎

27
Virtual Memory Technology
• Virtual memory is stored in the secondary storage device and helps to
extend additional memory capacity.
• Work with primary memory to load applications.
• Reduces the cost of expanding the capacity of physical memory.
• Implimentations differ from one OS to other.

28
Virtual Memory Technology
• Each process address space is partitioned into parts and used for code, data
and stack.
• Parts are loaded into primary memory when needed and written back to
secondary storage otherwise.
• The logical address space is referred to as virtual memory.
• Virtual memory is much larger than the physical memory.
• Virtual memory uses: Virtual address and Physical address.
• CPU translates Virtual address to Physical address.
• Virtual memory system uses paging.

29
Locating an Object in a Cache
1. Search for matching tag “Cache”

o SRAM cache Tag Data
Object Name 0: D 243
X = X? 1: X 17
• •
• •
• •
N-1: J 105
2. Use indirection to look up actual object location
• DRAM cache Lookup Table “Cache”
Location Data
Object Name D: 0 0: 243
X J: N-1 1: 17
• •
• •
• •
X: 1 N-1: 105
A System with Physical Memory Only
• Examples:
o most Cray machines, early PCs, nearly all embedded systems, etc.
Memory
0:
Physical 1:
Addresses
CPU
N-1:
Addresses generated by the CPU point directly to bytes in physical memory

A System with Virtual Memory
• Examples:
o workstations, servers, modern PCs, etc. Memory
0:
Page Table 1:
Virtual Physical
Addresses 0: Addresses
1:
CPU
P-1:
N-1:
Disk
Address Translation: the hardware converts virtual addresses into physical addresses via
an OS-managed lookup table (page table)
Page Faults (Similar to “Cache Misses”)
• What if an object is on disk rather than in memory?

o Page table entry indicates that the virtual address is not in memory
o An OS exception handler is invoked, moving data from disk into memory
• current process suspends, others can resume
• OS has full control over the placement.
Before fault After fault

Memory
Memory
Page Table
Virtual Physical Page Table
Addresses Addresses Virtual Physical
Addresses Addresses
CPU
CPU
Disk
Disk
Servicing a Page Fault
• Processor Signals Controller (1) Initiate Block Read

o Read block of length P starting Processor
Reg
at disk address X and store (3) Read
starting at memory address Y Done
Cache
• Read Occurs
o Direct Memory Access (DMA)
Memory-I/O bus
o Under control of I/O controller
• I/O Controller Signals Completion (2) DMA Transfer I/O
o Interrupt processor Memory controller
o OS resumes suspended
process disk
Disk disk
Disk

Solution: Separate Virtual Addr. Spaces
o Virtual and physical address spaces divided into equal-sized blocks

• blocks are called “pages” (both virtual and physical)
o Each process has its own virtual address space
• operating system controls how virtual pages are assigned to physical memory
0
Virtual 0 Address Translation Physical
VP 1 PP 2 Address
Address VP 2
Space for ... Space
Process 1: N-1 (DRAM)
PP 7 (e.g., read/only library
code)
Virtual 0
VP 1
Address VP 2 PP 10
Space for ...
N-1 M-1
Process 2:
Protection
• Page table entry contains access rights information
o hardware enforces this protection (trap into OS if violation occurs)
Page Tables Memory
Read? Write? Physical Addr 0:
VP 0: Yes No PP 9 1:
Process i: VP 1: Yes Yes PP 4
VP 2: No No XXXXXXX
• • •
• • •
• • •
Read? Write? Physical Addr
VP 0: Yes Yes PP 6
Process j: VP 1: Yes No PP 9 N-1:
VP 2: No No XXXXXXX
• • •
• • •
• • •
Virtual Memory Address Translation
V = {0, 1, . . . , N–1} virtual address space N>M
P = {0, 1, . . . , M–1} physical address space
MAP: V  P U {} address mapping function

MAP(a) = a' if data at virtual address a is present at physical
address a' in P
=  if data at virtual address a is not present in P
page fault
fault
Processor handler
Hardware 
Addr Trans
Main Secondary
a Mechanism Memory memory
a'
OS performs
virtual address part of the physical address this transfer
on-chip
memory mgmt unit (MMU) (only if miss)
Virtual Memory Address Translation
• Parameters
o P = 2p = page size (bytes).
o N = 2n = Virtual address limit
o M = 2m = Physical address limit
n–1 p p–1 0
virtual page number page offset virtual address
address translation
m–1 p p–1 0
physical page number page offset physical address
Notice that the page offset bits don't change as a result of translation
Page Tables
Memory resident
Virtual Page page table
Number (physical page
or disk address)
Valid Physical Memory
1
1
0
1
1
1
0
1
0 Disk Storage
1 (swap file or
regular file system file)

Integrating VM and Cache
VA PA miss
Trans- Main
CPU Cache
lation Memory
hit
data
• Most Caches “Physically Addressed”

o Accessed by physical addresses
o Allows multiple processes to have blocks in cache at same time
o Allows multiple processes to share pages
o Cache doesn’t need to be concerned with protection issues
• Access rights checked as part of address translation

Speeding up Translation with a TLB
• “Translation Lookaside Buffer” (TLB)

o Small hardware cache in MMU
o Maps virtual page numbers to physical page numbers
o Contains complete page table entries for small number of pages
hit
VA PA miss
TLB Main
CPU Cache
Lookup Memory
miss hit
Trans-
lation
data

Page Replacement
• Page replacement refers to the process in which a resident page in

main memory is replaced by a new page transferred from the disk.
• The goal of a page replacement policy is to minimize the number of
page faults.
• Reduce the effective memory access time.
• R(t): Resident set of all pages residing in the main memory at time t.
• Forward distance ft(x): The number of time slots from time t to the
first repeated reference of page x in the future.
• Backward distance bt(x): The number of time slots from time t to the
most recent reference of page x in the past.

Page Replacement Policies
o Least recently used (LRU): Replaces the page in R(t) which has the
longest backward distance.
o Optimal (OPT) algorithm: Replaces the page in R(t) with longest forward
distance.
o First –in-first-out (FIFO): Replaces the page in R(t) which has been in
memory for the longest time.
o Least frequently used (LFU): Replaces the page in R(t) which has been
least referenced in the past.
o Circular FIFO: Joins all the page frame entries into a circular FIFO
queue using a pointer to indicate the front of the queue.
o Random replacement: Trivial algorithm which chooses any page for
replacement randomly.
Advanced Computer Architecture
Chapter 6
Pipelining and Superscalar
Techniques
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

In this chapter…
• Linear Pipeline Processors

• Non-linear Pipeline Processors
• Instruction Pipeline Design
• Arithmetic Pipeline Design
• Superscalar Pipeline Design

2
LINEAR PIPELINE PROCESSORS
• Linear Pipeline Processor
o Cascade of processing stageswhich are linearly connected to perform a fixed function over a stream of
data flowing from one end to the other.
o Instruction execution, arithmetic computations, and memory-access operations.
• Models of Linear Pipeline
o Synchronous Model
o Asynchronous Model
o (Corresponding reservation tables)
• Clocking and Timing Control
o Clock Cycle
o Pipeline Frequency
o Clock skewing
o Flow-through delay
o Speedup factor
o Optimal number of Stages and Performance-Cost Ratio (PCR)

3

4

5
NON-LINEAR PIPELINE PROCESSORS
• Dynamic Pipeline
o Configured to perform variable functions at different times.
o Allows streamline, feed-forward connection and feedback connection
• Static Pipeline
o Used to perform fixed functions
• Reservation and Latency Analysis

o Reservation tables
o Evaluation time
• Latency Analysis
o Latency
o Collision
o Forbidden latencies
o Latency Sequence, Latency Cycle and Average Latency

6

7

8

9
INSTRUCTION PIPELINE DESIGN
• Instruction Execution Phases

o E.g. Fetch, Decode, Issue, Execute, Write-back
o In-order Instruction issuing and Reordered Instruction issuing
• E.g. X = Y + Z , A = B x C
• Mechanisms/Design Issues for Instruction Pipelining
o Pre-fetch Buffers
o Multiple Functional Units
o Internal Data Forwarding
o Hazard Avoidance
• Dynamic Scheduling
• Branch Handling Techniques

10
• Fetch: fetches instructions from memory; ideally one per cycle

• Decode: reveals instruction operations to be performed and identifies the resources needed
• Issue: reserves the resources and reads the operands from registers
• Execute: actual processing of operations as indicated by instruction
• Write Back: writing results into the registers

11

12

13

14
Mechanisms/Design Issues of Instruction Pipeline
• Pre-fetch Buffers
o Sequential Buffers
o Target Buffers
o Loop Buffers

15
• Multiple Functional Units
o Reservation Station and Tags
o Slow-station as Bottleneck stage
• Subdivision of Pipeline Bottleneck stage
• Replication of Pipeline Bottleneck stage
• (Example to be discussed)

16

17
• Internal Forwarding and Register Tagging
o Internal Forwarding:
• A “short-circuit” technique to replace unnecessary memory accesses by register-register
transfers in a sequence of fetch-arithmetic-store operations
o Register Tagging:
• Use of tagged registers , buffers and reservation stations, for exploiting concurrent activities
among multiple arithmetic units
o Store-Fetch Forwarding
• (M  R1, R2  M) replaced by (M  R1, R2  R1)
o Fetch-Fetch Forwarding
• (R1  M, R2  M) replaced by (R1  M, R2  R1)
o Store-Store Overwriting
• (M  R1, M  R2) replaced by (M  R2)

18

19

20
• Hazard Detection and Avoidance
o Domain or Input Set of an instruction
o Range or Output Set of an instruction
o Data Hazards: RAW, WAR and WAW
o Resolution using Register Renaming approach

21
Dynamic Instruction Scheduling
• Idea of Static Scheduling
o Compiler based scheduling strategy to resolve Interlocking among instructions
• Dynamic Scheduling
o Tomasulo’s Algorithm (Register-Tagging Scheme)
• Hardware based dependence-resolution
o Scoreboarding Technique
• Scoreboard: the centralized control unit
• A kind of data-driven mechanism

22

23

24
Branch Handling Techniques
• Branch Taken, Branch Target, Delay Slot
• Effect of Branching
o Parameters:
• k: No. of stages in the pipeline
• n: Total no. of instructions or tasks
• p: Percentage of Brach instructions over n
• q: Percentage of successful branch instructions (branch taken) over p.
• b: Delay Slot
• τ: Pipeline Cycle Time
o Branch Penalty = q of (p of n) * bτ = pqnbτ
o Effective Execution Time:
• Teff = [k + (n-1)] τ + pqnbτ = [k + (n-1) + pqnb]τ

25
• Effect of Branching
o Effective Throughput:
• Heff = n/Teff
• Heff = n / {[k + (n-1) + pqnb]τ} = nf / [k + (n-1) + pqnb]
• As nInfinity and b = k-1
o H*eff = f / [pq(k-1)+1]
• If p=0 and q=0 (no branching occurs)
o H**eff = f = 1/τ
o Performance Degradation Factor
• D = 1 – H*eff / f = pq(k-1) / [pq(k-1)+1]

26

27
• Branch Prediction
o Static Branch Prediction: based on branch code types
o Dynamic Branch prediction: based on recent branch history
• Strategy 1: Predict the branch direction based on information found at decode stage.
• Strategy 2: Use a cache to store target addresses at effective address calculation stage.
• Strategy 3: Use a cache to store target instructions at fetch stage
o Brach Target Buffer Organization
• Delayed Branches
o A delayed branch of d cycles allows at most d-1 useful instructions to be executed following the
branch taken.
o Execution of these instructions should be independent of branch instruction to achieve a zero
branch penalty

28

29

30
ARITHMETIC PIPELINE DESIGN
Computer Arithmetic Operations
• Finite-precision arithmetic
• Overflow and Underflow
• Fixed-Point operations
o Notations:
• Signed-magnitude, one’s complement and two-complement notation
o Operations:
• Addition: (n bit, n bit)  (n bit) Sum, 1 bit output carry
• Subtraction: (n bit, n bit)  (n bit) difference
• Multiplication: (n bit, n bit)  (2n bit) product
• Division: (2n bit, n bit)  (n bit) quotient, (n bit) remainder

31
Computer Arithmetic Operations
• Floating-Point Numbers
o X = (m, e) representation
• m: mantissa or fraction
• e: exponent with an implied base or radix r.
• Actual Value X = m * r e
o Operations on numbers X = (mx, ex) and Y = (my, ey)
• Addition: (mx * rex-ey + my, ey)
• Subtraction: (mx * rex-ey – my, ey)
• Multiplication: (mx * my, ex+ey)
• Division: (mx / my, ex – ey)
• Elementary Functions
o Transcendental functions like: Trigonometric, Exponential, Logarithmic, etc.

32
Static Arithmetic Pipelines
• Separate units for fixed point operations and floating point operations
• Scalar and Vector Arithmetic Pipelines
• Uni-functional or Static Pipelines
• Arithmetic Pipeline Stages
o Majorly involve hardware to perform: Add and Shift micro-operations
o Addition using: Carry Propagation Adder (CPA) and Carry Save Adder (CSA)
o Shift using: Shift Registers
• Multiplication Pipeline Design

o E.g. To multiply two 8-bit numbers that yield a 16-bit product using CSA and CPA Wallace Tree.

33

34

35
Multifunctional Arithmetic Pipelines
• Multifunctional Pipeline:
o Static multifunctional pipeline
o Dynamic multifunctional pipeline
• Case Study: T1/ASC static multifunctional pipeline architecture

36

37

38
SUPERSCALAR PIPELINE DESIGN
• Pipeline Design Parameters
o Pipeline cycle, Base cycle, Instruction issue rate, Instruction issue Latency, Simple Operation Latency
o ILP to fully utilize the pipeline
• Superscalar Pipeline Structure
• Data and Resource Dependencies
• Pipeline Stalling
• Superscalar Pipeline Scheduling
o In-order Issue and in-order completion
o In-order Issue and out-of-order completion
o Out-of-order Issue and out-of-order completion
• Superscalar Performance

39
Parameter Base Scalar Processor Super Scalar Processor

(degree = K)
Pipeline Cycle 1 (base cycle) K
Instruction Issue Rate 1 K
Instruction Issue Latency 1 1
Simple Operation Latency 1 1
ILP to fully utilize pipeline 1 K

40
4,

41

42

43

44

45

46
4,

47

48
• Time required by base scalar machine:

o T(1,1) = K + N – 1
• The ideal execution time required by m-issue superscalar machine:
o T(m,1) = K + (N – m)/m
o Where,
• K is the time required to execute first m instructions through m pipelines of k-stages
simultaneously
• Second term corresponds to time required to execute remaining N-m instructions , m per
cycle through m pipelines
• The ideal speedup of superscalar machine
o S(m,1) = T(1,1)/T(m,1) = m(N + k – 1)/[N+ m(k – 1)]
• As n  infinity, S(m,1) m
49

Aca Mod2&3 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Aca Mod2&3 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Advanced Computer Architecture

Source diginotes.in Save the earth. Go paperless

• Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

1. Search for matching tag “Cache”

Addresses generated by the CPU point directly to bytes in physical memory

• What if an object is on disk rather than in memory?

Before fault After fault

• Processor Signals Controller (1) Initiate Block Read

Source diginotes.in Save the earth. Go paperless

o Virtual and physical address spaces divided into equal-sized blocks

MAP: V  P U {} address mapping function

Source diginotes.in Save the earth. Go paperless

• Most Caches “Physically Addressed”

Source diginotes.in Save the earth. Go paperless

• “Translation Lookaside Buffer” (TLB)

Source diginotes.in Save the earth. Go paperless

• Page replacement refers to the process in which a resident page in

Source diginotes.in Save the earth. Go paperless

Source diginotes.in Save the earth. Go paperless

• Linear Pipeline Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

• Reservation and Latency Analysis

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

• Instruction Execution Phases

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

• Fetch: fetches instructions from memory; ideally one per cycle

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of