You are on page 1of 43

High Performance Parallel

Supercomputer
Dien Taufan Lessy
MCSCE
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Spring Semester 2014
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Contents
Introduction
Parallelism
Future Research
Conclusion
References
Literatures
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
The first Supercomputer
IBM Naval Ordnance Research Calculator

15000 operations/s
ADD(15 s), MUL(31 s), DIV(227 s)







[1]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
The first Supercomputer
Control Data Corporation 6600
1 MFLPOS








[2]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
Today (November 2013)
Tianhe-2 (MilkyWay-2)








[3]
Cores: 3.120.000
Rmax: 33.862,7 (TFLOPs)
Power: 17.808 kW

Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
Todays Ranking







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
HPC Vendor







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
Processor Generation







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
Segment user







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
OS







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
History
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
I ntroduction
[3]
Concept and Terminology
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
The von Neumman Computer
Walk-Through: c=a+b

1. Get next instruction
2. Decode: Fetch a
3. Fetch a to internal register
4. Get next instruction
5. Decode: fetch b
6. Fetch b to internal register
7. Get next instruction
8. Decode: add a and b (c in register)
9. Do the addition in ALU
10. Get next instruction
11. Decode: store c in main memory
12. Move c from internal register to main memory
Note: Some units are idle while others are workingwaste of cycles.
Pipelining (modularization) & Cashing (advance decoding)parallelism
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
How to make program faster?


Reduce cycle time t_c:
Increase clock frequency; however, there is a physical limit
Memory access
Reduce number of instructions n_i:
More efficient algorithms
Better compilers
Reduce CPI -- The key is parallelism.
Instruction-level parallelism. Pipelining technology
Vector processing
Internal parallelism, multiple functional units; superscalar processors;
multi-core processors (Superscalar and VLWI)
External parallelism, multiple CPUs, parallel machine
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Increasing Cycle Time
Moores Law
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Increasing Cycle Time
Core Voltage increase with frequency







[6]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
~
~
~
Power cost of Frequency

Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
High Performance Serial Processor needs
high power

Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Processor Memory GAP (Bootleneck)

Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Definition

Concurrent vs Parallel
a parallel computer is collection of processing
elements that communicate and cooperate to solve
large problems quickly - Almasi and Gottlieb 1989
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Speedup vs Efficiency
For given problem:

speedup(using P Processors) =


10 Processor with 2 times Speedup?
exec. time (P Processor)
exec. time (1 Processor)
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Serial vs Parallel







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Serial vs Parallel







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Serial vs Parallel







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Serial vs Parallel







[3]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Processor Type

Scalar processor
CISC: Complex Instruction Set Computer
Intel 80x86 (IA32)
RISC: Reduced Instruction Set Computer
Sun SPARC, IBM Power #, SGI MIPS
VLIW: Very Long Instruction Word; Explicitly parallel
instruction computing (EPIC); Probably dying
Intel IA64 (Itanium)
Vector processor;
Cray X1/T90; NEC SX#; Japan Earth Simulator; Early
Cray machines; Japan Life Simulator (hybrid)

Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
CISC vs RISC vs VLWI







[5]
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
Flynns Classical Taxonomy
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
SISD
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
SIMD
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
MISD
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
MIMD
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Memory Architecture
Shared Memory
Superscalar processors with L2
cache connected to memory
modules through a bus or crossbar
All processors have access to all
machine resources including
memory and I/O devices
SMP (symmetric multiprocessor): if
processors are all the same and
have equal access to machine
resources, i.e. it is symmetric.
SMP are UMA (Uniform Memory
Access) machines
e.g., A node of IBM SP machine;
SUN Ultraenterprise 10000
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Memory Architecture
Shared Memory
If bus,
Only one processor can access the
memory at a time.
Processors contend for bus to
access memory
If crossbar,
Multiple processors can access
memory through independent paths
Contention when different
processors access same memory
module
Crossbar can be very expensive.
Processor count limited by memory
contention and bandwidth
Max usually 64 or 128
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Memory Architecture
Distributed Memory
Superscalar processors with
local memory connected
through communication
network.
Each processor can only work
on data in local memory
Access to remote memory
requires explicit
communication.
Present-day large
supercomputers are all some
sort of distributed-memory
machines
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism
Memory Architecture
Hybrid Distributed-Shared Memory
Overall distributed
memory, SMP nodes
Most modern
supercomputers and
workstation clusters are
of this type
Message passing; or
hybrid message
passing/threading.
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
Amdahls Law
Suppose only part of an application seems parallel
Amdahls law
let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Parallelism







[3]
Amdahls Law
Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Future Research
Hardware
Addressing the memory wall and Inter-node
interconnection

Optical computing

Quantum computing


Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Conclusion
Improvement of Single-instruction-stream
requires a lot of effort for little gain
Parallel computing the only way to achieve
higher performance in the foreseeable
future
Supercomputer combines all of parallel
computing technology such as Paralle CPU,
Multicore, Scalar, Vector, etc



Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
References
[1]: http://www.columbia.edu/cu/computinghistory/norc.html, May 2014
[2]: http://en.wikipedia.org/wiki/File:CDC_6600.jc.jpg
[3]: http://www.top500.org/, May 2014
[4]: https://computing.llnl.gov/tutorials/parallel_comp/
[5]: http://15418.courses.cs.cmu.edu/spring2014/lecture/whyparallelism
[6]: http://www.intel.com
[7]: http://discovermagazine.com/galleries/zen-photo/m/moores-law
[8]: http://people.cs.clemson.edu/~mark/464/acmse_epic.pdf

Dien Taufan Lessy (3011464)
High Performace Parallel Supercomputer
Literature
Parallel Computer Architecture: A Hardware / Software
Approach, D.E. Culler, J.P. Singh
Computer Architecture: A Quantitative Approach, J.L.
Hennessy, D.A. Patterson
https://computing.llnl.gov/tutorials/parallel_comp/
https://computing.llnl.gov/tutorials/parallel_comp/OverviewRe
centSupercomputers.2008.pdf
http://www-users.cs.umn.edu/~karypis/parbook/
http://www.top500.org/
http://15418.courses.cs.cmu.edu