Cluster

1
Rajkumar Buyya, Monash University, Melbourne.

Email: rajkumar@csse.monash.edu.au / rajkumar@buyya.com
Web: http://www.ccse.monash.edu.au/~rajkumar / www.buyya.com
High Performance Cluster Computing
(Architecture, Systems, and Applications)

ISCA
2000
2
Objectives

] Learn and Share Recent advances in cluster
computing (both in research and commercial
settings):
Architecture,
System Software
Programming Environments and Tools
Applications
] Cluster Computing Infoware: (tutorial online)
http://www.buyya.com/cluster/
3
Agenda
Overview of Computing
Motivations & Enabling Technologies
Cluster Architecture & its Components
Clusters Classifications
Cluster Middleware
Single System Image
Representative Cluster Systems
Resources and Conclusions

4
P P P P P P
Microkernel
Multi-Processor Computing System
Threads Interface
Hardware
Operating System
Process Processor Thread
P
Applications
Computing Elements
Programming Paradigms
5

Architectures
System Software
Applications
P.S.Es
Architectures
System Software
Applications
P.S.Es
Sequential
Era
Parallel
Era
1940 50 60 70 80 90 2000 2030
Two Eras of Computing
Commercialization
R & D Commodity
6
Computing Power and
Computer Architectures
7
Computing Power (HPC) Drivers
Solving grand challenge applications using
computer modeling, simulation and analysis
Life Sciences
CAD/CAM
Aerospace
Military Applications
Digital Biology
Military Applications Military Applications
E-commerce/anything
8
How to Run App. Faster ?
] There are 3 ways to improve performance:
1. Work Harder
2. Work Smarter
3. Get Help
] Computer Analogy
1. Use faster hardware: e.g. reduce the time per
instruction (clock cycle).
2. Optimized algorithms and techniques
3. Multiple computers to solve problem: That
is, increase no. of instructions executed per
clock cycle.
9
10
Application Case Study

Web Serving and E-Commerce
11
E-Commerce and PDC ?
] What are/will be the major problems/issues in
eCommerce? How will or can PDC be applied to
solve some of them?
] Other than Compute Power, what else can PDC
contribute to e-commerce?
] How would/could the different forms of PDC
(clusters, hypercluster, GRID,) be applied to e-
commerce?
] Could you describe one hot research topic for
PDC applying to e-commerce?
] A killer e-commerce application for PDC ?
] ...
12
Killer Applications of Clusters
] Numerous Scientific & Engineering Apps.
] Parametric Simulations
] Business Applications
E-commerce Applications (Amazon.com, eBay.com .)
Database Applications (Oracle on cluster)
Decision Support Systems
] Internet Applications
Web serving / searching
Infowares (yahoo.com, AOL.com)
ASPs (application service providers)
eMail, eChat, ePhone, eBook, eCommerce, eBank, eSociety, eAnything!
Computing Portals
] Mission Critical Applications
command control systems, banks, nuclear reactor control, star-war, and
handling life threatening situations.
13
Major problems/issues in E-
commerce
Social Issues
Capacity Planning
] Multilevel Business Support (e.g., B2P2C)
] Information Storage, Retrieval, and Update
] Performance
] Heterogeneity
] System Scalability
] System Reliability
] Identification and Authentication
] System Expandability
] Security
] Cyber Attacks Detection and Control
(cyberguard)
] Data Replication, Consistency, and Caching
Manageability (administration and control)
14
Amazon.com: Online sales/trading
killer E-commerce Portal
] Several Thousands of Items
books, publishers, suppliers
] Millions of Customers
Customers details, transactions details, support
for transactions update
] (Millions) of Partners
Keep track of partners details, tracking referral
link to partner and sales and payment
] Sales based on advertised price
] Sales through auction/bids
A mechanism for participating in the bid
(buyers/sellers define rules of the game)
15
Can these drive
E-Commerce ?
] Clusters are already in use for web serving, web-hosting, and
number of other Internet applications including E-commerce
scalability, availability, performance, reliable-high
performance-massive storage and database support.
Attempts to support online detection of cyber attacks (through
data mining) and control
] Hyperclusters and the GRID:
Support for transparency in (secure) Site/Data Replication for high
availability and quick response time (taking site close to the user).
Compute power from hyperclusters/Grid can be used for data
mining for cyber attacks and fraud detection and control.
Helps to build Compute Power Market, ASPs, and Computing
Portals.
2100 2100 2100 2100
2100 2100 2100 2100
16
Science Portals - e.g., PAPIA system
PAPIA PC Cluster
Pentiums
Myrinet
NetBSD/Linuux
PM
Score-D
MPC++
RWCP Japan: http://www.rwcp.or.jp/papia/
17
PDC hot topics for E-commerce

] Cluster based web-servers, search engineers, portals
] Scheduling and Single System Image.
] Heterogeneous Computing
] Reliability and High Availability and Data Recovery
] Parallel Databases and high performance-reliable-mass storage
systems.
] CyberGuard! Data mining for detection of cyber attacks, frauds, etc.
detection and online control.
] Data Mining for identifying sales pattern and automatically tuning
portal to special sessions/festival sales
] eCash, eCheque, eBank, eSociety, eGovernment, eEntertainment,
eTravel, eGoods, and so on.
] Data/Site Replications and Caching techniques
] Compute Power Market
] Infowares (yahoo.com, AOL.com)
] ASPs (application service providers)
] . . .
18
Sequential Architecture
Limitations
Sequential architectures reaching physical
limitation (speed of light, thermodynamics)
Hardware improvements like pipelining,
Superscalar, etc., are non-scalable and
requires sophisticated Compiler
Technology.
Vector Processing works well for certain
kind of problems.
19
No. of Processors
C
.
P
.
I
.

1 2 . . . .
Computational Power Improvement
Multiprocessor
Uniprocessor
20
Age
G
r
o
w
t
h

5 10 15 20 25 30 35 40 45 . . . .
Human Physical Growth Analogy:
Computational Power Improvement
Vertical
Horizontal
21
The Tech. of PP is mature and can be
exploited commercially; significant
R & D work on development of tools
& environment.
Significant development in
Networking technology is paving a
way for heterogeneous computing.
Why Parallel Processing
NOW?
22
History of Parallel Processing
LPP can be traced to a tablet dated
around 100 BC.
4 Tablet has 3 calculating positions.
4 Infer that multiple positions:
Reliability/ Speed
23
Aggregated speed with
which complex calculations
carried out by millions of neurons in
human brain is amazing! although
individual neurons response is slow
(milli sec.) - demonstrate the
feasibility of PP
Motivating Factors
24
7 Simple classification by Flynn:
(No. of instruction and data streams)
SISD - conventional
SIMD - data parallel, vector computing
MISD - systolic arrays
MIMD - very general, multiple approaches.
7 Current focus is on MIMD model, using
general purpose processors or
multicomputers.

Taxonomy of Architectures
25
Main HPC Architectures..1a
] SISD - mainframes, workstations, PCs.
] SIMD Shared Memory - Vector machines, Cray...
] MIMD Shared Memory - Sequent, KSR, Tera, SGI,
SUN.
] SIMD Distributed Memory - DAP, TMC CM-2...
] MIMD Distributed Memory - Cray T3D, Intel,
Transputers, TMC CM-5, plus recent workstation
clusters (IBM SP2, DEC, Sun, HP).
26
Motivation for using Clusters
]The communications bandwidth between
workstations is increasing as new
networking technologies and protocols are
implemented in LANs and WANs.
]Workstation clusters are easier to integrate
into existing networks than special parallel
computers.
27
Main HPC Architectures..1b.
] NOTE: Modern sequential machines are not purely
SISD - advanced RISC processors use many
concepts from
vector and parallel architectures (pipelining,
parallel execution of instructions, prefetching of
data, etc) in order to achieve one or more
arithmetic operations per clock cycle.
28
Parallel Processing Paradox
]Time required to develop a parallel
application for solving GCA is equal
to:

Half Life of Parallel Supercomputers.
29
The Need for Alternative
Supercomputing Resources
]Vast numbers of under utilised
workstations available to use.
]Huge numbers of unused processor
cycles and resources that could be
put to good use in a wide variety of
applications areas.
]Reluctance to buy Supercomputer
due to their cost and short life span.
]Distributed compute resources fit
better into today's funding model.
30
Technology Trend
31
Scalable Parallel
Computers
32
Design Space of Competing
Computer Architecture
33
Towards Inexpensive
Supercomputing
It is:

Cluster Computing..
The Commodity
Supercomputing!
34
Cluster Computing -
Research Projects
] Beowulf (CalTech and NASA) - USA
] CCS (Computing Centre Software) - Paderborn, Germany
] Condor - Wisconsin State University, USA
] DQS (Distributed Queuing System) - Florida State University, US.
] EASY - Argonne National Lab, USA
] HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US
] far - University of Liverpool, UK
] Gardens - Queensland University of Technology, Australia
] MOSIX - Hebrew University of Jerusalem, Israel
] MPI (MPI Forum, MPICH is one of the popular implementations)
] NOW (Network of Workstations) - Berkeley, USA
] NIMROD - Monash University, Australia
] NetSolve - University of Tennessee, USA
] PBS (Portable Batch System) - NASA Ames and LLNL, USA
] PVM - Oak Ridge National Lab./UTK/Emory, USA
35
Cluster Computing -
Commercial Software
] Codine (Computing in Distributed Network Environment) -
GENIAS GmbH, Germany
] LoadLeveler - IBM Corp., USA
] LSF (Load Sharing Facility) - Platform Computing, Canada
] NQE (Network Queuing Environment) - Craysoft Corp., USA
] OpenFrame - Centre for Development of Advanced
Computing, India
] RWPC (Real World Computing Partnership), Japan
] Unixware (SCO-Santa Cruz Operations,), USA
] Solaris-MC (Sun Microsystems), USA
] ClusterTools (A number for free HPC clusters tools from Sun)
] A number of commercial vendors worldwide are offering
clustering solutions including IBM, Compaq, Microsoft, a
number of startups like TurboLinux, HPTI, Scali,
BlackStone..)
36
]Surveys show utilisation of CPU cycles of
desktop workstations is typically <10%.
]Performance of workstations and PCs is
rapidly improving
]As performance grows, percent utilisation
will decrease even further!
]Organisations are reluctant to buy large
supercomputers, due to the large expense
and short useful life span.
37
]The development tools for workstations
are more mature than the contrasting
proprietary solutions for parallel
computers - mainly due to the non-
standard nature of many parallel systems.
]Workstation clusters are a cheap and
readily available alternative to
specialised High Performance Computing
(HPC) platforms.
]Use of clusters of workstations as a
distributed compute resource is very cost
effective - incremental growth of system!!!
38
Cycle Stealing
]Usually a workstation will be owned by an
individual, group, department, or
organisation - they are dedicated to the
exclusive use by the owners.
]This brings problems when attempting to
form a cluster of workstations for running
distributed applications.
39
Cycle Stealing
]Typically, there are three types of owners,
who use their workstations mostly for:
1. Sending and receiving email and preparing
documents.
2. Software development - edit, compile, debug and
test cycle.
3. Running compute-intensive applications.
40
Cycle Stealing
]Cluster computing aims to steal spare cycles
from (1) and (2) to provide resources for (3).
]However, this requires overcoming the
ownership hurdle - people are very protective
of their workstations.
]Usually requires organisational mandate that
computers are to be used in this way.
]Stealing cycles outside standard work hours
(e.g. overnight) is easy, stealing idle cycles
during work hours without impacting
interactive use (both CPU and memory) is
much harder.
41
Rise & Fall of Computing
Technologies
Mainframes Minis PCs
Minis PCs Network
Computing
1970 1980 1995
42
Original Food Chain Picture
43
1984 Computer Food Chain
Mainframe
Vector Supercomputer
Mini Computer
Workstation
PC
44
Mainframe
Vector Supercomputer
MPP
Workstation
PC
1994 Computer Food Chain
Mini Computer
(hitting wall soon)
(future is bleak)
45
Computer Food Chain (Now and Future)
46
What is a cluster?
]A cluster is a type of parallel or distributed
processing system, which consists of a
collection of interconnected stand-
alone/complete computers cooperatively
working together as a single, integrated
computing resource.
]A typical cluster:
Network: Faster, closer connection than a typical
network (LAN)
Low latency communication protocols
Looser connection than SMP
47
Why Clusters now?
(Beyond Technology and Cost)
] Building block is big enough
complete computers (HW & SW) shipped in
millions: killer micro, killer RAM, killer disks,
killer OS, killer networks, killer apps.
] Workstations performance is doubling every 18
months.
] Networks are faster
] Higher link bandwidth (v 10Mbit Ethernet)
]Switch based networks coming (ATM)
]Interfaces simple & fast (Active Msgs)
] Striped files preferred (RAID)
] Demise of Mainframes, Supercomputers, & MPPs
48
Architectural Drivers(cont)
] Node architecture dominates performance
processor, cache, bus, and memory
design and engineering $ => performance
] Greatest demand for performance is on large systems
must track the leading edge of technology without lag
] MPP network technology => mainstream
system area networks
] System on every node is a powerful enabler
very high speed I/O, virtual memory, scheduling,

49
...Architectural Drivers
] Clusters can be grown: Incremental scalability (up,
down, and across)
Individual nodes performance can be improved by
adding additional resource (new memory blocks/disks)
New nodes can be added or nodes can be removed
Clusters of Clusters and Metacomputing
] Complete software tools
Threads, PVM, MPI, DSM, C, C++, Java, Parallel
C++, Compilers, Debuggers, OS, etc.
] Wide class of applications
Sequential and grand challenging parallel applications

Clustering of Computers
for Collective Computing: Trends
1960
1990 1995+
2000
?
51
Example Clusters:
Berkeley NOW
] 100 Sun
UltraSparcs
200 disks
] Myrinet SAN
160 MB/s
] Fast comm.
AM, MPI, ...
] Ether/ATM
switched
external net
] Global OS
] Self Config
52
Basic Components
$
P
M
I/O bus
MyriNet
P
Sun Ultra 170
Myricom
NIC
160 MB/s
M
53
Massive Cheap Storage
Cluster
] Basic unit:
2 PCs double-ending
four SCSI chains of 8
disks each
Currently serving Fine Art at http://www.thinker.org/imagebase/

54
Cluster of SMPs (CLUMPS)
] Four Sun E5000s
8 processors
4 Myricom NICs each
] Multiprocessor, Multi-
NIC, Multi-Protocol

] NPACI => Sun 450s
55
Millennium PC Clumps
] Inexpensive, easy
to manage Cluster
] Replicated in many
departments
] Prototype for very
large PC cluster
56
Adoption of the Approach
57
So Whats So Different?
] Commodity parts?
] Communications Packaging?
] Incremental Scalability?
] Independent Failure?
] Intelligent Network Interfaces?
] Complete System on every node
virtual memory
scheduler
files
...
58
OPPORTUNITIES
&
CHALLENGES
59
Shared Pool of
Computing Resources:
Processors, Memory, Disks
Interconnect
Guarantee atleast one
workstation to many individuals
(when active)
Deliver large % of collective
resources to few individuals
at any one time
Opportunity of Large-scale
Computing on NOW
60
Windows of Opportunities
] MPP/DSM:
Compute across multiple systems: parallel.
] Network RAM:
Idle memory in other nodes. Page across
other nodes idle memory
] Software RAID:
file system supporting parallel I/O and
reliablity, mass-storage.
] Multi-path Communication:
Communicate across multiple networks:
Ethernet, ATM, Myrinet

61
Parallel Processing
] Scalable Parallel Applications require
good floating-point performance
low overhead communication scalable
network bandwidth
parallel file system
62
Network RAM
] Performance gap between processor and
disk has widened.

] Thrashing to disk degrades performance
significantly

] Paging across networks can be effective
with high performance networks and OS
that recognizes idle machines

] Typically thrashing to network RAM can be 5
to 10 times faster than thrashing to disk
63
Software RAID: Redundant
Array of Workstation Disks
] I/O Bottleneck:
Microprocessor performance is improving more
than 50% per year.
Disk access improvement is < 10%
Application often perform I/O
] RAID cost per byte is high compared to single
disks
] RAIDs are connected to host computers which are
often a performance and availability bottleneck
] RAID in software, writing data across an array of
workstation disks provides performance and some
degree of redundancy provides availability.
64
Software RAID, Parallel File
Systems, and Parallel I/O
65

Cluster Computer and its
Components
66
Clustering Today
]Clustering gained momentum when 3
technologies converged:
1. Very HP Microprocessors
workstation performance = yesterday supercomputers
2. High speed communication
Comm. between cluster nodes >= between processors
in an SMP.
3. Standard tools for parallel/ distributed
computing & their growing popularity.
67
Cluster Computer
Architecture
68
Cluster Components...1a
Nodes
]Multiple High Performance Components:
PCs
Workstations
SMPs (CLUMPS)
Distributed HPC Systems leading to
Metacomputing
]They can be based on different
architectures and running difference OS
69
Cluster Components...1b
Processors
] There are many (CISC/RISC/VLIW/Vector..)
Intel: Pentiums, Xeon, Merceed.
Sun: SPARC, ULTRASPARC
HP PA
IBM RS6000/PowerPC
SGI MPIS
Digital Alphas
] Integrate Memory, processing and
networking into a single chip
IRAM (CPU & Mem): (http://iram.cs.berkeley.edu)
Alpha 21366 (CPU, Memory Controller, NI)
70
Cluster Components2
OS
]State of the art OS:
Linux (Beowulf)
Microsoft NT (Illinois HPVM)
SUN Solaris (Berkeley NOW)
IBM AIX (IBM SP2)
HP UX (Illinois - PANDA)
Mach (Microkernel based OS) (CMU)
Cluster Operating Systems (Solaris MC, SCO Unixware,
MOSIX (academic project)
OS gluing layers: (Berkeley Glunix)

71
Cluster Components3
High Performance Networks
]Ethernet (10Mbps),
]Fast Ethernet (100Mbps),
]Gigabit Ethernet (1Gbps)
]SCI (Dolphin - MPI- 12micro-sec
latency)
]ATM
]Myrinet (1.2Gbps)
]Digital Memory Channel
]FDDI
72
Cluster Components4
Network Interfaces
]Network Interface Card
Myrinet has NIC
User-level access support
Alpha 21364 processor integrates
processing, memory controller,
network interface into a single chip..
73
Cluster Components5
Communication Software
] Traditional OS supported facilities (heavy
weight due to protocol processing)..
Sockets (TCP/IP), Pipes, etc.
] Light weight protocols (User Level)
Active Messages (Berkeley)
Fast Messages (Illinois)
U-net (Cornell)
XTP (Virginia)
] System systems can be built on top of the
above protocols
74
Cluster Components6a
Cluster Middleware
]Resides Between OS and Applications
and offers in infrastructure for supporting:
Single System Image (SSI)
System Availability (SA)
]SSI makes collection appear as single
machine (globalised view of system
resources). Telnet cluster.myinstitute.edu
]SA - Check pointing and process
migration..
75
Cluster Components6b
Middleware Components
]Hardware
DEC Memory Channel, DSM (Alewife, DASH) SMP
Techniques
]OS / Gluing Layers
Solaris MC, Unixware, Glunix)
]Applications and Subsystems
System management and electronic forms
Runtime systems (software DSM, PFS etc.)
Resource management and scheduling (RMS):
CODINE, LSF, PBS, NQS, etc.

76
Cluster Components7a
Programming environments
] Threads (PCs, SMPs, NOW..)
POSIX Threads
Java Threads
] MPI
Linux, NT, on many Supercomputers
] PVM
] Software DSMs (Shmem)
77
Cluster Components7b
Development Tools ?
]Compilers
C/C++/Java/ ;
Parallel programming with C++ (MIT Press book)
]RAD (rapid application development
tools).. GUI based tools for PP
modeling
]Debuggers
]Performance Analysis Tools
]Visualization Tools
78
Cluster Components8
Applications
]Sequential
]Parallel / Distributed (Cluster-aware
app.)
Grand Challenging applications
Weather Forecasting
Quantum Chemistry
Molecular Biology Modeling
Engineering Analysis (CAD/CAM)
.
PDBs, web servers,data-mining
79
Key Operational Benefits of Clustering
] System availability (HA). offer inherent high system
availability due to the redundancy of hardware,
operating systems, and applications.
] Hardware Fault Tolerance. redundancy for most system
components (eg. disk-RAID), including both hardware
and software.
] OS and application reliability. run multiple copies of the
OS and applications, and through this redundancy
] Scalability. adding servers to the cluster or by adding
more clusters to the network as the need arises or CPU to
SMP.
] High Performance. (running cluster enabled programs)
80

Classification
of Cluster Computer
81
Clusters Classification..1
]Based on Focus (in Market)
High Performance (HP) Clusters
Grand Challenging Applications
High Availability (HA) Clusters
Mission Critical applications
82
HA Cluster: Server Cluster with
"Heartbeat" Connection
83
]Based on Workstation/PC Ownership
Dedicated Clusters
Non-dedicated clusters
Adaptive parallel computing
Also called Communal multiprocessing
84
]Based on Node Architecture..
Clusters of PCs (CoPs)
Clusters of Workstations (COWs)
Clusters of SMPs (CLUMPs)
85
Building Scalable Systems:
Cluster of SMPs (Clumps)
Performance of SMP Systems Vs.
Four-Processor Servers in a Cluster
86
]Based on Node OS Type..
Linux Clusters (Beowulf)
Solaris Clusters (Berkeley NOW)
NT Clusters (HPVM)
AIX Clusters (IBM SP2)
SCO/Compaq Clusters (Unixware)
.Digital VMS Clusters, HP
clusters, ..

87
]Based on node components
architecture & configuration
(Processor Arch, Node Type:
PC/Workstation.. & OS: Linux/NT..):
Homogeneous Clusters
All nodes will have similar configuration
Heterogeneous Clusters
Nodes based on different processors and
running different OSes.
88
Clusters Classification..6a
Dimensions of Scalability & Levels of
Clustering
Network
Technology
Platform
Uniprocessor
SMP
Cluster
MPP
(1)
(2)
(3)
Campus
Enterprise
Workgroup
Department
Public
Metacomputing (GRID)
89
Clusters Classification..6b
Levels of Clustering
]Group Clusters (#nodes: 2-99)
(a set of dedicated/non-dedicated computers -
mainly connected by SAN like Myrinet)
] Departmental Clusters (#nodes: 99-999)
] Organizational Clusters (#nodes: many 100s)
] (using ATMs Net)
] Internet-wide Clusters=Global Clusters:
(#nodes: 1000s to many millions)
Metacomputing
Web-based Computing
Agent Based Computing
Java plays a major in web and agent based computing
90
Size Scalability (physical & application)
Enhanced Availability (failure management)
Single System Image (look-and-feel of one system)
Fast Communication (networks & protocols)
Load Balancing (CPU, Net, Memory, Disk)
Security and Encryption (clusters of clusters)
Distributed Environment (Social issues)
Manageability (admin. And control)
Programmability (simple API if required)
Applicability (cluster-aware and non-aware app.)
Major issues in cluster
design
91

Cluster Middleware
and
Single System Image
92
A typical Cluster Computing
Environment

PVM / MPI/ RSH
Application
Hardware/OS
???
93
CC should support
] Multi-user, time-sharing environments
] Nodes with different CPU speeds and
memory sizes (heterogeneous configuration)
] Many processes, with unpredictable
requirements
] Unlike SMP: insufficient bonds between
nodes
Each computer operates independently
Inefficient utilization of resources
94
The missing link is provide by
cluster middleware/underware

PVM / MPI/ RSH
Application
Hardware/OS
Middleware or
Underware
95
SSI Clusters--SMP services on a CC
] Adaptive resource usage for better
performance
] Ease of use - almost like SMP
] Scalable configurations - by decentralized
control

Result: HPC/HAC at PC/Workstation prices
Pool Together the Cluster-Wide resources
96
What is Cluster Middleware ?
] An interface between between use
applications and cluster hardware and OS
platform.
] Middleware packages support each other at
the management, programming, and
implementation levels.
] Middleware Layers:
SSI Layer
Availability Layer: It enables the cluster services of
Checkpointing, Automatic Failover, recovery from
failure,
fault-tolerant operating among all cluster nodes.
97
Middleware Design Goals
] Complete Transparency (Manageability)
Lets the see a single cluster system..
Single entry point, ftp, telnet, software loading...
] Scalable Performance
Easy growth of cluster
no change of API & automatic load distribution.
] Enhanced Availability
Automatic Recovery from failures
Employ checkpointing & fault tolerant technologies
Handle consistency of data when replicated..
98
What is Single System Image
(SSI) ?
]A single system image is the
illusion, created by software or
hardware, that presents a
collection of resources as one,
more powerful resource.
]SSI makes the cluster appear like a
single machine to the user, to
applications, and to the network.
]A cluster without a SSI is not a
cluster
99
Benefits of Single System
Image
] Usage of system resources transparently
] Transparent process migration and load
balancing across nodes.
] Improved reliability and higher availability
] Improved system response time and
performance
] Simplified system management
] Reduction in the risk of operator errors
] User need not be aware of the underlying
system architecture to use these machines
effectively
100
Desired SSI Services
] Single Entry Point
telnet cluster.my_institute.edu
telnet node1.cluster. institute.edu
] Single File Hierarchy: xFS, AFS, Solaris MC Proxy
] Single Control Point: Management from single GUI
] Single virtual networking
] Single memory space - Network RAM / DSM
] Single Job Management: Glunix, Codine, LSF
] Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may
it can use Web technology
101
Availability Support
Functions
] Single I/O Space (SIO):
any node can access any peripheral or disk devices
without the knowledge of physical location.
] Single Process Space (SPS)
Any process on any node create process with cluster
wide process wide and they communicate through
signal, pipes, etc, as if they are one a single node.
] Checkpointing and Process Migration.
Saves the process state and intermediate results in
memory to disk to support rollback recovery when
node fails. PM for Load balancing...

] Reduction in the risk of operator errors

] User need not be aware of the underlying system
architecture to use these machines effectively
102
Scalability Vs. Single System
Image

UP
103
SSI Levels/How do we
implement SSI ?
] It is a computer science notion of levels of
abstractions (house is at a higher level of
abstraction than walls, ceilings, and floors).

Application and Subsystem Level
Operating System Kernel Level
Hardware Level
104
SSI at Application and
Subsystem Level
Level Examples Boundary Importance
application cluster batch system,
system management
subsystem
file system
distributed DB,
OSF DME, Lotus
Notes, MPI, PVM
an application what a user
wants
Sun NFS, OSF,
DFS, NetWare,
and so on
a subsystem SSI for all
applications of
the subsystem
implicitly supports
many applications
and subsystems
shared portion of
the file system
toolkit OSF DCE, Sun
ONC+, Apollo
Domain
best level of
support for heter-
ogeneous system
explicit toolkit
facilities: user,
service name,time
(c) In search of clusters
105
SSI at Operating System
Kernel Level
Kernel/
OS Layer
Solaris MC, Unixware
MOSIX, Sprite,Amoeba
/ GLunix
kernel
interfaces
virtual
memory
UNIX (Sun) vnode,
Locus (IBM) vproc
each name space:
files, processes,
pipes, devices, etc.
kernel support for
applications, adm
subsystems
none supporting
operating system kernel
type of kernel
objects: files,
processes, etc.
modularizes SSI
code within
kernel
may simplify
implementation
of kernel objects
each distributed
virtual memory
space
microkernel
Mach, PARAS, Chorus,
OSF/1AD, Amoeba
implicit SSI for
all system services
each service
outside the
microkernel
106
SSI at Harware Level
memory
SCI, DASH
better communica-
tion and synchro-
nization
memory space
memory
and I/O

SCI, SMP techniques lower overhead
cluster I/O
memory and I/O
device space
Application and Subsystem Level
Operating System Kernel Level
107
SSI Characteristics
]1. Every SSI has a boundary
]2. Single system support can exist
at different levels within a system,
one able to be build on another
108
SSI Boundaries -- an
applications SSI boundary
Batch System
SSI
Boundary
(c) In search
of clusters
109
Relationship Among
Middleware Modules
110
SSI via OS path!
] 1. Build as a layer on top of the existing OS
Benefits: makes the system quickly portable, tracks
vendor software upgrades, and reduces development
time.
i.e. new systems can be built quickly by mapping
new services onto the functionality provided by the
layer beneath. Eg: Glunix
] 2. Build SSI at kernel level, True Cluster OS
Good, but Cant leverage of OS improvements by
vendor
E.g. Unixware, Solaris-MC, and MOSIX
111
SSI Representative Systems
]OS level SSI
SCO NSC UnixWare
Solaris-MC
MOSIX, .
]Middleware level SSI
PVM, TreadMarks (DSM), Glunix,
Condor, Codine, Nimrod, .
]Application level SSI
PARMON, Parallel Oracle, ...
112
SCO NonStop
Cluster for UnixWare

Users, applications, and
systems management
Standard OS
kernel calls
Modular
kernel
extensions
Extensions
UP or SMP node
Users, applications, and
systems management
Standard OS
kernel calls
Modular
kernel
extensions
Extensions
Devices Devices

ServerNet

UP or SMP node
Standard SCO
UnixWare

with clustering
hooks
Standard SCO
UnixWare

with clustering
hooks
Other nodes
http://www.sco.com/products/clustering/
113
How does NonStop Clusters
Work?
] Modular Extensions and Hooks to Provide:
Single Clusterwide Filesystem view
Transparent Clusterwide device access
Transparent swap space sharing
Transparent Clusterwide IPC
High Performance Internode Communications
Transparent Clusterwide Processes, migration,etc.
Node down cleanup and resource failover
Transparent Clusterwide parallel TCP/IP networking
Application Availability
Clusterwide Membership and Cluster timesync
Cluster System Administration
Load Leveling
114
Solaris-MC: Solaris for
MultiComputers
] global file
system
] globalized
process
management
] globalized
networking
and I/O
Solaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Other
nodes
Object invocations
Kernel
Solaris MC
Applications
http://www.sun.com/research/solaris-mc/
115
Solaris MC components
] Object and
communication
support
] High availability
support
] PXFS global
distributed file
system
] Process
mangement
] Networking
Solaris MC Architecture
System call interface
Network
File system
C++
Processes
Object framework
Existing Solaris 2.5 kernel
Other
nodes
Object invocations
Kernel
Solaris MC
Applications
116
Multicomputer OS for UNIX
(MOSIX)
] An OS module (layer) that provides the
applications with the illusion of working on a single
system
] Remote operations are performed like local
operations
] Transparent to the application - user interface
unchanged
PVM / MPI / RSH
Application
Hardware/OS
http://www.mosix.cs.huji.ac.il/
117
Main tool
] Supervised by distributed algorithms that
respond on-line to global resource
availability - transparently
] Load-balancing - migrate process from over-
loaded to under-loaded nodes
] Memory ushering - migrate processes from a
node that has exhausted its memory, to prevent
paging/swapping
Preemptive process migration that can
migrate--->any process, anywhere, anytime
118
MOSIX for Linux at HUJI
] A scalable cluster configuration:
50 Pentium-II 300 MHz
38 Pentium-Pro 200 MHz (some are SMPs)
16 Pentium-II 400 MHz (some are SMPs)
] Over 12 GB cluster-wide RAM
] Connected by the Myrinet 2.56 G.b/s LAN
Runs Red-Hat 6.0, based on Kernel 2.2.7
] Upgrade: HW with Intel, SW with Linux
] Download MOSIX:
http://www.mosix.cs.huji.ac.il/
119
NOW @ Berkeley
] Design & Implementation of higher-level system
]Global OS (Glunix)
]Parallel File Systems (xFS)
]Fast Communication (HW for Active Messages)
]Application Support
] Overcoming technology shortcomings
]Fault tolerance
]System Management
] NOW Goal: Faster for Parallel AND Sequential
http://now.cs.berkeley.edu/
120
NOW Software Components
AM L.C.P.
VN segment
Driver
Unix
Workstation
AM L.C.P.
VN segment
Driver
Unix
Workstation
AM L.C.P.
VN segment
Driver
Unix
Workstation
AM L.C.P.
VN segment
Driver
Unix (Solaris)
Workstation
Global Layer Unix
Myrinet Scalable Interconnect
Large Seq. Apps
Parallel Apps
Sockets, Split-C, MPI, HPF, vSM
Active Messages Name Svr
121
3 Paths for Applications on
NOW?
] Revolutionary (MPP Style): write new programs from
scratch using MPP languages, compilers, libraries,
] Porting: port programs from mainframes,
supercomputers, MPPs,
] Evolutionary: take sequential program & use
1) Network RAM: first use memory of many
computers to reduce disk accesses; if not fast
enough, then:
2) Parallel I/O: use many disks in parallel for
accesses not in file cache; if not fast enough,
then:
3) Parallel program: change program until it sees
enough processors that is fast=> Large speedup
without fine grain parallel program
122
Comparison of 4 Cluster Systems
123
Cluster Programming
Environments
] Shared Memory Based
DSM
Threads/OpenMP (enabled for clusters)
Java threads (HKU JESSICA, IBM cJVM)
] Message Passing Based
PVM (PVM)
MPI (MPI)
] Parametric Computations
Nimrod/Clustor
] Automatic Parallelising Compilers
] Parallel Libraries & Computational Kernels (NetSolve)
124
Code-Granularity
Code Item
Large grain
(task level)
Program

Medium grain
(control level)
Function (thread)

Fine grain
(data level)
Loop (Compiler)

Very fine grain
(multiple issue)
With hardware
Levels of Parallelism
Task i-l Task i Task i+1
func1 ( )
{
....
....
}
func2 ( )
{
....
....
}
func3 ( )
{
....
....
}
a ( 0 ) =..
b ( 0 ) =..
a ( 1 )=..
b ( 1 )=..
a ( 2 )=..
b ( 2 )=..
+ x
Load
PVM/MPI
Threads
Compilers
CPU
125
MPI (Message Passing
Interface)
] A standard message passing interface.
MPI 1.0 - May 1994 (started in 1992)
C and Fortran bindings (now Java)
] Portable (once coded, it can run on virtually all HPC
platforms including clusters!
] Performance (by exploiting native hardware features)
] Functionality (over 115 functions in MPI 1.0)
environment management, point-to-point &
collective communications, process group,
communication world, derived data types, and virtual
topology routines.
] Availability - a variety of implementations available,
both vendor and public domain.
http://www.mpi-forum.org/
126
A Sample MPI Program...
# include <stdio.h>
# include <string.h>
#include mpi.h
main( int argc, char *argv[ ])
{
int my_rank; /* process rank */
int p; /*no. of processes*/
int source; /* rank of sender */
int dest; /* rank of receiver */
int tag = 0; /* message tag, like email subject */
char message[100]; /* buffer */
MPI_Status status; /* function return status */
/* Start up MPI */
MPI_Init( &argc, &argv );
/* Find our process rank/id */
MPI_Comm_rank( MPI_COM_WORLD, &my_rank);
/*Find out how many processes/tasks part of this run */
MPI_Comm_size( MPI_COM_WORLD, &p);

(master)
(workers)
Hello,...
127
A Sample MPI Program
if( my_rank == 0) /* Master Process */
{
for( source = 1; source < p; source++)
{
MPI_Recv( message, 100, MPI_CHAR, source, tag, MPI_COM_WORLD, &status);
printf(%s \n, message);
}
}
else /* Worker Process */
{
sprintf( message, Hello, I am your worker process %d!, my_rank );
dest = 0;
MPI_Send( message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COM_WORLD);
}
/* Shutdown MPI environment */
MPI_Finalise();
}
128
Execution
% cc -o hello hello.c -lmpi
% mpirun -p2 hello
Hello, I am process 1!
% mpirun -p4 hello
% mpirun hello
(no output, there are no workers.., no greetings)
129
PARMON: A Cluster
Monitoring Tool
PARMON
High-Speed
Switch
parmond
parmon
PARMON Server
on each node
PARMON Client on JVM
http://www.buyya.com/parmon/
130
Resource Utilization at a
Glance
131
Single I/O Space and
Design Issues
Globalised Cluster Storage
Reference:
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O
Space, I EEE Concurrency, March, 1999
by K. Hwang, H. Jin et.al
132

Without Single I/O Space
Users

With Single I/O Space Services
Users
Single I/O Space Services
Clusters with & without Single
I/O Space
133
Benefits of Single I/O Space
] Eliminate the gap between accessing local disk(s) and remote
disks
] Support persistent programming paradigm
] Allow striping on remote disks, accelerate parallel I/O
operations
] Facilitate the implementation of distributed checkpointing and
recovery schemes
134
Single I/O Space Design Issues
] Integrated I/O Space
] Addressing and Mapping Mechanisms
] Data movement procedures
135
Integrated I/O Space
Sequential
addresses
. . .
B11
SD
1
SD
2

SD
m

. . .
. . .
. . .
. . .
. . .
. . .
D11 D12 D1t
D21 D22
D2t
Dn1 Dn2
Dnt
B12
B1k
B21
B22
B2k
Bm1
Bm2
Bmk
LD
1

LD2
LD
n

Local
Disks,
(RADD
Space)
Shared
RAIDs,
(NASD Space)
. . .
P
1

P
h

. . .
Peripherals
(NAP Space)
136
User-level
Middleware
plus some
Modified OS
System Calls
User Applications
RADD
I/O Agent

Name Agent Disk/RAID/
NAP Mapper
Block Mover
I/O Agent

NASD
I/O Agent
NAP
I/O Agent
Addressing and Mapping
137
Data Movement Procedures
Node 1
LD
2
or SD
i

of the NASD
Block
Mover
User
Application
I/O Agent
Node 2
I/O Agent
A
A
LD
1

Node 1
LD
2
or SD
i

of the NASD
Block
Mover
User
Application
I/O Agent
Node 2
I/O Agent
A
Request
Data
Block A
LD
1

138
What Next ??
Clusters of Clusters (HyperClusters)
Global Grid
Interplanetary Grid
Universal Grid??
139
Clusters of Clusters (HyperClusters)
Scheduler
Master
Daemon
Execution
Daemon
Submit
Graphical
Control
Clients
Cluster 2
Scheduler
Master
Daemon
Execution
Daemon
Submit
Graphical
Control
Clients
Cluster 3
Scheduler
Master
Daemon
Execution
Daemon
Submit
Graphical
Control
Clients
Cluster 1
LAN/WAN
140
Towards Grid Computing.
For illustration, placed resources arbitrarily on the GUSTO test-bed!!
141
What is Grid ?
] An infrastructure that couples
Computers (PCs, workstations, clusters, traditional
supercomputers, and even laptops, notebooks, mobile
computers, PDA, and so on)
Software ? (e.g., renting expensive special purpose applications
on demand)
Databases (e.g., transparent access to human genome database)
Special Instruments (e.g., radio telescope--SETI@Home
Searching for Life in galaxy, Austrophysics@Swinburne for
pulsars)
People (may be even animals who knows ?)
] across the local/wide-area networks (enterprise,
organisations, or Internet) and presents them as
an unified integrated (single) resource.
142
Conceptual view of the Grid
Leading to Portal (Super)Computing
http://www.sun.com/hpc/
143
Grid Application-Drivers
] Old and New applications getting enabled due
to coupling of computers, databases,
instruments, people, etc:
(distributed) Supercomputing
Collaborative engineering
high-throughput computing
large scale simulation & parameter studies
Remote software access / Renting Software
Data-intensive computing
On-demand computing
144
Grid Components
Grid
Fabric
Networked Resources across
Organisations
Computers
Clusters Data Sources Scientific Instruments Storage Systems
Local Resource Managers
Operating Systems Queuing Systems TCP/IP & UDP

Libraries & App Kernels

Distributed Resources Coupling Services
Comm. Sign on & Security Information

QoS
Process Data Access
Development Environments and Tools
Languages Libraries
Debuggers

Web tools
Resource Brokers Monitoring
Applications and Portals
Prob. Solving Env.
Scientific

Collaboration Engineering Web enabled Apps
Grid
Apps.
Grid
Middleware
Grid
Tools
145
Many GRID Projects and Initiatives
] PUBLIC FORUMS
Computing Portals
Grid Forum
European Grid Forum
IEEE TFCC!
GRID2000 and more.
] Australia
Nimrod/G
EcoGrid and GRACE
DISCWorld
] Europe
UNICORE
MOL
METODIS
Globe
Poznan Metacomputing
CERN Data Grid
MetaMPI
DAS
JaWS
and many more...
] Public Grid Initiatives
Distributed.net
SETI@Home
Compute Power Grid
] USA
Globus
Legion
JAVELIN
AppLes
NASA IPG
Condor
Harness
NetSolve
NCSA Workbench
WebFlow
EveryWhere
and many more...
] Japan
Ninf
Bricks
and many more...
http://www.gridcomputing.com/
146
NetSolve
Client/Server/Agent -- Based Computing
Client-Server design
Network-enabled solvers
Seamless access to resources
Non-hierarchical system
Load Balancing
Fault Tolerance
Interfaces to Fortran, C, Java, Matlab, more
Easy-to-use tool to provide efficient and uniform
access to a variety of scientific packages on UNIX platforms
NetSolve Client
NetSolve Agent
Network Resources
Software Repository
Software is available
www.cs.utk.edu/netsolve/
request
choice
reply
147
Host D
Host C
Host B
Host A
Virtual
Machine
Operation within VM uses
Distributed Control
process control
user features
HARNESS daemon
Customization
and extension
by dynamically
adding plug-ins
Component
based daemon
Discovery and registration
Another
VM
HARNESS Virtual Machine
HARNESS Virtual Machine
Scalable Distributed control and CCA based Daemon Scalable Distributed control and CCA based Daemon
http://www.epm.ornl.gov/harness/
148
HARNESS Core Research HARNESS Core Research
Parallel Plug-ins for Heterogeneous Distributed Virtual Machine Parallel Plug-ins for Heterogeneous Distributed Virtual Machine
One research goal is to understand and implement
a dynamic parallel plug-in environment.
provides a method for many users to extend Harness
in much the same way that third party serial plug-ins
extend Netscape, Photoshop, and Linux.
Research issues with Parallel plug-ins include:
heterogeneity, synchronization, interoperation, partial success

(three typical cases):

load plug-in into single host of VM w/o communication
load plug-in into single host broadcast to rest of VM
load plug-in into every host of VM w/ synchronization
149
Nimrod - A Job Management
System
http://www.dgs.monash.edu.au/~davida/nimrod.html
150
Job processing with Nimrod
151
Nimrod/G Architecture
Middleware Services
Nimrod/G Client Nimrod/G Client Nimrod/G Client
Grid Information Services
Schedule Advisor
Trading Manager
Nimrod Engine
GUSTO Test Bed
Persistent
Store
Grid Explorer
GE GIS
TM TS
RM & TS
RM & TS
RM & TS
Dispatcher
RM: Local Resource Manager, TS: Trade Server
152
User
Application
Resource Broker
A Resource Domain
Grid Explorer
Schedule Advisor
Trade Manager
Job
Control
Agent
Deployment Agent
Trade Server
Resource Allocation
Resource
Reservation
R
1

Other
services
Trading
Grid Information Server
R
2
R
n

Charging Alg.
Accounting
Compute Power Market
153

Pointers to Literature on
Cluster Computing
154
Reading Resources..1a
Internet & WWW
Computer Architecture:
http://www.cs.wisc.edu/~arch/www/
PFS & Parallel I/O
http://www.cs.dartmouth.edu/pario/
Linux Parallel Procesing
http://yara.ecn.purdue.edu/~pplinux/Sites/
DSMs
http://www.cs.umd.edu/~keleher/dsm.html

155
Reading Resources..1b
Internet & WWW
Solaris-MC
http://www.sunlabs.com/research/solaris-mc
Microprocessors: Recent Advances
http://www.microprocessor.sscc.ru
Beowulf:
http://www.beowulf.org
Metacomputing
http://www.sis.port.ac.uk/~mab/Metacomputing/
156
Reading Resources..2
Books
In Search of Cluster
by G.Pfister, Prentice Hall (2ed), 98
High Performance Cluster Computing
Volume1: Architectures and Systems
Volume2: Programming and Applications
Edited by Rajkumar Buyya, Prentice Hall, NJ, USA.
Scalable Parallel Computing
by K Hwang & Zhu, McGraw Hill,98
157
Reading Resources..3
Journals
A Case of NOW, IEEE Micro, Feb95
by Anderson, Culler, Paterson
Fault Tolerant COW with SSI, IEEE
Concurrency, (to appear)
by Kai Hwang, Chow, Wang, Jin, Xu
Cluster Computing: The Commodity
Supercomputing, Journal of Software
Practice and Experience-(get from my web)
by Mark Baker & Rajkumar Buyya
158
Cluster Computing Infoware
http://www.csse.monash.edu.au/~rajkumar/cluster/
159
Cluster Computing Forum
IEEE Task Force on Cluster Computing
(TFCC)

http://www.ieeetfcc.org
160
TFCC Activities...

] Network Technologies
] OS Technologies
] Parallel I/O
] Programming Environments
] Java Technologies
] Algorithms and Applications
] >Analysis and Profiling
] Storage Technologies
] High Throughput Computing
161
TFCC Activities...

] High Availability
] Single System Image
] Performance Evaluation
] Software Engineering
] Education
] Newsletter
] Industrial Wing
] TFCC Regional Activities
All the above have there own pages, see pointers
from:
http://www.ieeetfcc.org
162
TFCC Activities...
] Mailing list, Workshops, Conferences, Tutorials,
Web-resources etc.

] Resources for introducing subject in senior
undergraduate and graduate levels.
] Tutorials/Workshops at IEEE Chapters..
] .. and so on.
] FREE MEMBERSHIP, please join!
] Visit TFCC Page for more details:
http://www.ieeetfcc.org (updated daily!).
163
Clusters Revisited
164
Summary
We have discussed Clusters
Enabling Technologies
Architecture & its Components
Classifications
Middleware
Single System Image
Representative Systems

165
Conclusions
Clusters are promising..
Solve parallel processing paradox
Offer incremental growth and matches with
funding pattern.
New trends in hardware and software
technologies are likely to make clusters more
promising..so that
Clusters based supercomputers can be seen
everywhere!
166
167

Thank You ...
?
168
Backup Slides...
169
SISD : A Conventional Computer
Speed is limited by the rate at which computer can
transfer information internally.
Processor
Data Input
Data Output
I
n
s
t
r
u
c
t
i
o
n
s

Ex:PC, Macintosh, Workstations
170
The MISD Architecture
More of an intellectual exercise than a practical configuration.
Few built, but commercially not available
Data
Input
Stream
Data
Output
Stream
Processor
A
Processor
B
Processor
C
Instruction
Stream A
Instruction
Stream B
Instruction Stream C
171
SIMD Architecture
Ex: CRAY machine vector processing, Thinking machine cm*
C
i
<= A
i
* B
i
Instruction
Stream
Processor
A
Processor
B
Processor
C
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Data Output
stream B
Data Output
stream C
172
Unlike SISD, MISD, MIMD computer works asynchronously.
Shared memory (tightly coupled) MIMD
Distributed memory (loosely coupled) MIMD
MIMD Architecture
Processor
A
Processor
B
Processor
C
Data Input
stream A
Data Input
stream B
Data Input
stream C
Data Output
stream A
Data Output
stream B
Data Output
stream C
Instruction
Stream A
Instruction
Stream B
Instruction
Stream C

Cluster

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cluster

Hochgeladen von

Copyright:

Verfügbare Formate

1

Rajkumar Buyya, Monash University, Melbourne.

Cluster for UnixWare

Das könnte Ihnen auch gefallen