Lec#4 - Types of WorkLoads

Simulation, Modeling and Analysis of
Computer Networks
(ECE 6620)
Type of WorkLoads
(Chapter#4)
Art of Computer Systems Performance Analysis

By R. Jain
Dr. M. Hasan Islam
2010 Raj Jain www.rajjain.com
4-1
Overview
Terminology
Test Workloads for Computer Systems
Addition Instruction
Instruction Mixes
Kernels
Synthetic Programs
Application Benchmarks: Sieve, Ackermann's Function,
Debit-Credit, SPEC
4-2
Workload Selection
Computer system performance measurements involve monitoring the

system while it is being subjected to a particular workload
In order to perform meaningful measurements, the workload should be
carefully selected
To achieve that goal, the performance analyst needs to understand the
following before performing measurements:
1.
2.
3.
4.
5.
6.
7.
What are the different types of workloads?

Which workloads are commonly used by other analysts?
How are the appropriate workload types selected?
How is the measured workload data summarized?
How is the system performance monitored?
How can desired workload be placed on the system in a controlled
manner?
How are the results of the evaluation presented?
4-3
Terminology
Test workload
Any workload used in performance studies
Test workload can be real or synthetic
Real workload
Observed on a system being used for normal operations
Cannot be repeated, generally not suitable for use as a test workload
Synthetic workload:
Similar to real workload
Can be applied repeatedly in a controlled manner
No large real-world data files
No sensitive data
Easily modified without affecting operation
Easily ported to different systems due to its small size
May have built-in measurement capabilities

4-4
Test Workloads for Computer Systems

1.
2.
3.
4.
5.
Instruction Mixes
Kernels
Synthetic Programs
Application Benchmarks
4-5
Processors were the most expensive and most used
components of the system
Addition was the most frequent instruction
Thus, as a first approximation, the computer with the
faster addition instruction was considered to be the
better performer
The addition instruction was the sole workload used,
and the addition time was the sole performance metric
4-6
Instruction Mixes
Specification of various instructions coupled with their usage

frequency
Gibson mix: Developed by Jack C. Gibson in 1959 for IBM
704 systems.
4-7
Instruction Mixes (Cont)
Disadvantages:
Complex classes of instructions not reflected in the mixes.
Instruction time varies with:
Addressing modes
Cache hit rates
Pipeline efficiency
Interference from other devices during processor-memory
access cycles
Parameter values
Frequency of zeros as a parameter
The distribution of zero digits in a multiplier
Average number of positions of pre-shift in floating-point add
Number of times a conditional branch is taken
4-8
Instruction Mixes (Cont)
Performance Metrics:
MIPS = Millions of Instructions Per Second
MFLOPS = Millions of Floating Point Operations Per Second
It must be pointed that the instruction mixes only measure the

speed of the processor
This may or may not have effect on the total system performance
when the system consists of many other components
System performance is limited by the performance of the

bottleneck component, and unless the processor is the bottleneck
(that is, the usage is mostly compute bound), the MIPS rate of
the processor does not reflect the system performance
4-9
Kernels
Introduction of pipelining, instruction caching, and various address

translation mechanisms made computer instruction times highly variable
Instead, it became more appropriate to consider a set of instructions, which

constitutes a higher level function, a service provided by the processors
Such and function is called a Kernel (most frequent function (algorithms))
Most of the initial kernels did not make use of the input/output (I/O)
devices and concentrated solely on the processor performance, this class of
kernels could be called the processing kernel
An individual instruction could no longer be considered in isolation
Commonly used kernels: Sieve, Puzzle, Tree Searching, Ackerman's Function,

Matrix Inversion, and Sorting
Disadvantages: Do not make use of I/O devices or OS services, and thus,

the kernel performance does not reflect the total system performance
4-10
Synthetic Programs
To measure I/O performance lead analysts develop simple

exerciser loops that make a specified number of service calls or
I/O requests (Exerciser loops)
Allows them to compute the average CPU time and elapsed
time for each service call
Exerciser loops are also used to measure operating system
services such as process creation, forking, and memory
allocation
In order to maintain portability to different operating systems,
such exercisers are usually written in high-level languages such
as FORTRAN or Pascal
First exerciser loop was by Buchholz (1969) who called it a
synthetic program
4-11
Synthetic Programs
Advantage:
Quickly developed and given to different vendors

No real data files
Easily modified and ported to different systems
Have built-in measurement capabilities
Measurement process is automated
Repeated easily on successive versions of the operating systems
Disadvantages:
Too small
Do not make representative memory or disk references
Mechanisms for page faults and disk cache may not be adequately
exercised
CPU-I/O overlap may not be representative
Loops may create synchronizations, better or worse performance
4-12
Synthetic
workload
generation
program
4-13
Application Benchmarks
If computer systems to be compared are to be used for a particular

application (banking or airline reservations), a representative
subset of functions for that application may be used
Benchmarking
Such benchmarks are generally described in terms of functions to be

performed and make use of almost all resources in the system, including
processors, I/O devices, networks, and databases
Process of performance comparison for two or more systems by measurements
Workloads used in the measurements are called benchmarks
Some Authors: Benchmark = set of programs taken from real

workloads
Popular Benchmarks
Sieve, Ackermanns Function, Whetstone, LINPACK, Dhrystone, Lawrence Livermore Loops,

Debit-Credit Benchmark, SPEC Benchmark Suite
4-14
Sieve
The sieve kernel has been used to compare microprocessors,
personal computers, and high-level languages
Based on Eratosthenes' sieve algorithm: find all prime numbers
below a given number n.
Algorithm:
Write down all integers from 1 to n
Strike out all multiples of k, for k=2, 3, , n.
Example:
Write down all numbers from 1 to 20. Mark all as prime:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
Remove all multiples of 2 from the list of primes:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
4-15
Sieve (Cont)
The next integer in the sequence is 3. Remove all
multiples of 3:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20
5 > 20 Stop
Pascal Program to Implement the Sieve Kernel:
See Program listing Figure 4.2 in the book
4-16
Ackermann's Function
To assess the efficiency of the procedure-calling mechanisms

The function has two parameters and is defined recursively
Ackermann(3, n) evaluated for values of n from one to six.
Metrics:
Average execution time per call
Number of instructions executed per call, and
Stack space per call
Verification: Ackermann(3, n) = 2n+3-3
Number of recursive calls in evaluating Ackermann(3,n):
(512 4n-1 -15 2n+3 + 9n + 37)/3
This expression is used to compute the execution time per call
Depth of the procedure calls = 2n+3-4

Stack space required doubles when n is increase by 1
4-17
Other Benchmarks
Whetstone
U.S. Steel
LINPACK
Dhrystone
Doduc
TOP
Lawrence Livermore Loops
Digital Review Labs
Abingdon Cross Image-Processing Benchmark
4-18
Debit-Credit Benchmark
A de facto standard for transaction processing
systems.
First recorded in Anonymous et al (1975).
In 1973, a retail bank wanted to put its 1000
branches, 10,000 tellers, and 10,000,000 accounts
online with a peak load of 100 Transactions Per
Second (TPS).
Each TPS requires 10 branches, 100 tellers, and
100,000 accounts.
4-19
Debit-Credit (Cont)
4-20
Debit-Credit Benchmark (Continued)
Metric: price/performance ratio.

Performance
Response time
Measured as the time interval between the arrival of the last bit from the
communications line and the sending of the first bit to the
communications line
Cost
Throughput in terms of TPS such that 95% of all transactions provide

one second or less response time
Total expenses for a five-year period on purchase, installation, and

maintenance of the hardware and software in the machine room.
Cost does not include expenditures for terminals,

communications, application development, or operations
4-21
Debit-Credit Transaction Pseudo-Code
4-22
Pseudo-code Definition of Debit-Credit

Four record types: account, teller, branch, and history
Fifteen percent of the transactions require remote
access
Transactions Processing Performance Council (TPC)
was formed in August 1988
TPC BenchmarkTM A is a variant of the debit-credit
Metrics: TPS such that 90% of all transactions provide
two seconds or less response time
4-23
SPEC Benchmark Suite
Systems Performance Evaluation Cooperative (SPEC): Nonprofit corporation formed by leading computer vendors to
develop a standardized set of benchmarks.
Release 1.0 consists of the 10 benchmarks: GCC, Espresso,
Spice 2g6, Doduc, LI, Eqntott, Matrix300, Fpppp, Tomcatv
Primarily stress the CPU, Floating Point Unit (FPU), and to
some extent the memory subsystem compare CPU
speeds.
Benchmarks to compare I/O and other subsystems may be
included in future releases.
4-24

1. GCC
The time for the GNU C Compiler to convert 19 preprocessed source files into assembly language
output is measured
This benchmark is representative of a software engineering environment and measures the
compiling efficiency of a system
2. Espresso
An Electronic Design Automation (EDA) tool that performs heuristic boolean function
minimization for Programmable Logic Arrays (PLAs)
The elapsed time to run a set of seven input models is measured.
3. Spice 2g6
Spice, another representative of the EDA environment, is a widely used analog circuit simulation
tool
The time to simulate a bipolar circuit is measured.
4. Doduc
This is a synthetic benchmark that performs a Monte Carlo simulation of certain aspects of a
nuclear reactor. Because of its iterative structure and abundance of short branches and compact
loops, it tests the cache memory effectiveness.
5. NASA7
This is a collection of seven floating-point intensive kernels performing matrix operations on
double-precision data.
4-25
6. LI:
Elapsed time to solve the popular 9-queens problem by the LISP interpreter is
measured
7. Eqntom
Translates a logical representation of a boolean equation to a truth table
8. Matrix300
Performs various matrix operations using several LINPACK routines on
matrices of size 300 300
The code uses double-precision floating-point arithmetic and is highly
vectorizable
9. Fpppp
This is a quantum chemistry benchmark that performs two electron integral
derivatives using double-precision floating-point FORTRAN. It is difficult to
vectorize.
10. Tomcatv
A vectorized mesh generation program using double-precision floating-point
FORTRAN
Since it is highly vectorizable, substantial speedups have been observed on
several shared-memory multiprocessor systems
4-26
SPEC (Cont)
The elapsed time to run two copies of a benchmark on each of

the N processors of a system (a total of 2N copies) is measured
and compared with the time to run two copies of the
benchmark on a reference system (which is VAX-11/780 for
Release 1.0).
For each benchmark, the ratio of the time on the system under
test and the reference system is reported as SPECthruput
using a notation of #CPU@Ratio. For example, a system with
three CPUs taking 1/15 times as long as the the reference
system on GCC benchmark has a SPECthruput of 3@15.
Measure of the per processor throughput relative to the
reference system
4-27
SPEC (Cont)
The aggregate throughput for all processors of a multiprocessor

system can be obtained by multiplying the ratio by the number
of processors. For example, the aggregate throughput for the
above system is 45.
The geometric mean of the SPECthruputs for the 10
benchmarks is used to indicate the overall performance for the
suite and is called SPECmark.
4-28

Lec#4 - Types of WorkLoads

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lec#4 - Types of WorkLoads

Hochgeladen von

Copyright:

Verfügbare Formate

Simulation, Modeling and Analysis of

Art of Computer Systems Performance Analysis

Dr. M. Hasan Islam

2010 Raj Jain www.rajjain.com

2010 Raj Jain www.rajjain.com

Computer system performance measurements involve monitoring the

What are the different types of workloads?

Any workload used in performance studies

Test workload can be real or synthetic

Observed on a system being used for normal operations

Cannot be repeated, generally not suitable for use as a test workload

Similar to real workload

Can be applied repeatedly in a controlled manner

No large real-world data files

Easily modified without affecting operation

Easily ported to different systems due to its small size

May have built-in measurement capabilities

Test Workloads for Computer Systems

2010 Raj Jain www.rajjain.com

2010 Raj Jain www.rajjain.com

Specification of various instructions coupled with their usage

2010 Raj Jain www.rajjain.com

Instruction Mixes (Cont)

Instruction Mixes (Cont)

MIPS = Millions of Instructions Per Second

MFLOPS = Millions of Floating Point Operations Per Second

It must be pointed that the instruction mixes only measure the

System performance is limited by the performance of the

Introduction of pipelining, instruction caching, and various address

Instead, it became more appropriate to consider a set of instructions, which

Such and function is called a Kernel (most frequent function (algorithms))

An individual instruction could no longer be considered in isolation

Commonly used kernels: Sieve, Puzzle, Tree Searching, Ackerman's Function,

Disadvantages: Do not make use of I/O devices or OS services, and thus,

2010 Raj Jain www.rajjain.com

To measure I/O performance lead analysts develop simple

Quickly developed and given to different vendors

2010 Raj Jain www.rajjain.com

If computer systems to be compared are to be used for a particular

Such benchmarks are generally described in terms of functions to be

Some Authors: Benchmark = set of programs taken from real

Sieve, Ackermanns Function, Whetstone, LINPACK, Dhrystone, Lawrence Livermore Loops,

2010 Raj Jain www.rajjain.com

2010 Raj Jain www.rajjain.com

To assess the efficiency of the procedure-calling mechanisms

Depth of the procedure calls = 2n+3-4

2010 Raj Jain www.rajjain.com

2010 Raj Jain www.rajjain.com

2010 Raj Jain www.rajjain.com

Debit-Credit Benchmark (Continued)

Metric: price/performance ratio.

Throughput in terms of TPS such that 95% of all transactions provide

Total expenses for a five-year period on purchase, installation, and

Cost does not include expenditures for terminals,

Debit-Credit Transaction Pseudo-Code

2010 Raj Jain www.rajjain.com

Pseudo-code Definition of Debit-Credit

2010 Raj Jain www.rajjain.com

SPEC Benchmark Suite

2010 Raj Jain www.rajjain.com

SPEC Benchmark Suite

2010 Raj Jain www.rajjain.com

SPEC Benchmark Suite

The elapsed time to run two copies of a benchmark on each of