0 Stimmen dafür0 Stimmen dagegen

81 Aufrufe145 SeitenAug 17, 2012

© Attribution Non-Commercial (BY-NC)

PPTX, PDF, TXT oder online auf Scribd lesen

Attribution Non-Commercial (BY-NC)

Als PPTX, PDF, TXT **herunterladen** oder online auf Scribd lesen

81 Aufrufe

Attribution Non-Commercial (BY-NC)

Als PPTX, PDF, TXT **herunterladen** oder online auf Scribd lesen

- IEEE VLSI PROJECTS 2012
- Advanced Computer Architecture April2003 NR 321502
- NVIDIA_OpenCL_BestPracticesGuide
- MC-openmp
- Lecture 1
- Formula 1 Brake Intake Optimization
- Advance Compu
- 7 Full
- 16f8x
- Unit10 Final Paper-Microprocessors Artificial Intellect and Solar System Exploration
- Cr08 09042803f-Final-9 Published Copy
- The Design and Analysis of Parallel Algorithm by S.G.akl
- mp1
- Parallel Algorithm Lecture Notes
- Mult Core
- 01 01 Parallel Computing Explained
- Parallel Algorithm
- Feb07 Parallel Databases
- Parallel Arch
- Introduction to Systems Architecture_Chapter1

Sie sind auf Seite 1von 145

Conventional sequential machines

Program executed is instruction stream, and data operated on is data stream

SIMD Single Instruction stream, Multiple Data streams

Vector machines (superscalar)

Processors execute same program, but operate on different data streams

MIMD Multiple Instruction streams, Multiple Data streams

Parallel machines

Independent processors execute different programs, using unique data streams

MISD Multiple Instruction streams, Single Data stream

Systolic array machines

Common data structure is manipulated by separate processors, executing different

instruction streams (programs)

Anshul Kumar, CSE IITD slide 2

SISD

C P

M

IS

IS DS

Anshul Kumar, CSE IITD slide 3

SIMD

C

P

P

M

IS

DS

DS

Anshul Kumar, CSE IITD slide 4

MISD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Anshul Kumar, CSE IITD slide 5

MIMD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Classification of Parallel Architectures

Parallel architectures

PAs

Data-parallel

architecture

Function-parallel

architectures

Instruction-level

PAs Thread level PAs

Process-level

PAs

ILPs MIMDs

Pipelined

processors

VLIWs

Superscalar

processors

Distributed

Memory

MIMD

Shared

Memory

MIMD

Vector

architectures

Associative

And neural

architectures

SIMDs

Systolic

architectures

DPs

[Ref : Sima et al]

What is Pipelining

A technique used in advanced microprocessors where the

microprocessor begins executing a second instruction before

the first has been completed.

- A Pipeline is a series of stages, where some work is done at

each stage. The work is not finished until it has passed

through all stages.

With pipelining, the computer architecture allows the next

instructions to be fetched while the processor is performing

arithmetic operations, holding them in a buffer close to the

processor until each instruction operation can performed.

Four Pipelined Instructions

IF

IF

IF

IF

ID

ID

ID

ID

EX

EX

EX

EX M

M

M

M

W

W

W

W

5

1

1

1

Instructions Fetch

The instruction Fetch (IF) stage is responsible for obtaining the

requested instruction from memory. The instruction and the

program counter (which is incremented to the next

instruction) are stored in the IF/ID pipeline register as

temporary storage so that may be used in the next stage at

the start of the next clock cycle.

Instruction Decode

The Instruction Decode (ID) stage is responsible for decoding

the instruction and sending out the various control lines to

the other parts of the processor. The instruction is sent to the

control unit where it is decoded and the registers are fetched

from the register file.

Memory and IO

The Memory and IO (MEM) stage is responsible for storing

and loading values to and from memory. It also responsible

for input or output from the processor. If the current

instruction is not of Memory or IO type than the result from

the ALU is passed through to the write back stage.

DATA FLOW COMPUTERS

Data flow computers execute the instructions

as the data becomes available

Data Flow architectures are highly

asynchronous

In the data flow architecture there is no need

to store intermediate or final results, because

they are passed as tokens among instructions.

(cont.)

The program sequencing depends on the data

availability

The information appears as operation packets

and data tokens

Operation packet = opcode + operands +

destination

Data tokens = data(result) +destination

(cont.)

Data flow computers have packet

communication architecture

Dataflow computers have distributed

multiprocessor organization

Copyright 2004 David J. Lilja 15

What is a performance metric?

Count

Of how many times an event occurs

Duration

Of a time interval

Size

Of some parameter

A value derived from these fundamental

measurements

Copyright 2004 David J. Lilja 16

Good metrics are

Linear -- nice, but not necessary

Reliable -- required

Repeatable -- required

Easy to use -- nice, but not necessary

Consistent -- required

Independent -- required

III. PERFORMANCE METRICS

A performance metric is a measure of a systems performance.

It focuses on measuring a certain aspect of the

system and allows comparison of various types of systems.

The criteria for evaluating performance in parallel computing

can include: speedup, efficiency and scalability.

Speedup

Speedup is the most basic of parameters in multiprocessing

systems and shows how much a parallel algorithm is faster

than a sequential algorithm. It is defined as follows:

Sp =T1/Tp

where Sp is the speedup, T1 is the execution time for a

sequential algorithm, Tp is the execution time for a parallel

algorithm and p is the number of processors.

There are three possibilities for speedup: linear, sublinear

and super-linear, shown in Figure 1. When Sp = p, i.e.

when the speedup is equal to the number of processors, the

speedup is called linear. In such a case, doubling the number

of

processors, will double the speedup. In the case of sub-linear

speedup, increasing the number of processors, decreases the

speedup. Most algorithms are sub-linear, because of various

overheads associated with multiple processors, like

communication.

This can occur because of the increasing parallel

overhead from such areas as: interprocessor communication,

load imbalance, synchronization, and extra computation. An

interesting case occurs in super-linear speedup, which can

mainly be due to cache size increase

Efficiency

Another performance metric in parallel

computing is efficiency,

It is defined as the achieved fraction of total

potential parallel processing gain. It estimates

how well the

processors are used in solving the problem.

Ep =Sp/p =T1/pTp

where Ep is the efficiency.

The Random Access Machine Model

RAM model of serial computers:

Memory is a sequence of words, each capable

of containing an integer.

Each memory access takes one unit of time

Basic operations (add, multiply, compare) take

one unit time.

Instructions are not modifiable

Read-only input tape, write-only output tape

7.2.2 The PRAM model

The PRAM is an idealized parallel machine which was developed as a

straightforward generalization of the sequential RAM. Because we will be

using it often, we give here a detailed description of it.

Description

A PRAM uses p identical processors, each one with a distinct id-number, and

able to perform the usual computation of a sequential RAM that is equipped

with a finite amount of local memory. The processors communicate through

some shared global memory (Figure 7.4) to which all are connected. The

shared memory contains a finite number of memory cells. There is a global

clock that sets the pace of the machine executon. In one time-unit period

each processor can perform, if so wishes, any or all of the following three

steps:

1. Read from a memory location, global or local;

2. Execute a single RAM operation, and

3. Write to a memory location, global or local.

Control

Global memory

P

1

Private memory

P

2

Private memory

P

n

Private memory

Interconnection network

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

ThepowerofaPRAMdependsonthe

kindofaccesstothesharedmemory

locations.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

Ineveryclockcycle,

IntheExclusiveReadExclusiveWrite

(EREW)PRAM,eachmemorylocation

canbeaccessedonlybyoneprocessor.

IntheConcurrentReadExclusiveWrite

(CREW)PRAM,multipleprocessorcan

readfromthesamememorylocation,but

onlyoneprocessorcanwrite.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

IntheConcurrentReadConcurrentWrite

(CRCW)PRAM,multipleprocessorcan

readfromorwritetothesamememory

location.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

Itiseasytoallowconcurrentreading.

However,concurrentwritinggivesriseto

conflicts.

Ifmultipleprocessorswritetothesame

memorylocationsimultaneously,itisnot

clearwhatiswrittentothememory

location.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

IntheCommonCRCWPRAM,allthe

processorsmustwritethesamevalue.

IntheArbitraryCRCWPRAM,oneofthe

processorsarbitrarilysucceedsinwriting.

InthePriorityCRCWPRAM,processors

haveprioritiesassociatedwiththemand

thehighestpriorityprocessorsucceedsin

writing.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

TheEREWPRAMistheweakestandthe

PriorityCRCWPRAMisthestrongest

PRAMmodel.

TherelativepowersofthedifferentPRAM

modelsareasfollows.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

Analgorithmdesignedforaweaker

modelcanbeexecutedwithinthesame

timeandworkcomplexitiesona

strongermodel.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

WesaymodelAislesspowerful

comparedtomodelBifeither:

thetimecomplexityforsolvinga

problemisasymptoticallylessinmodel

BascomparedtomodelA.or,

ifthetimecomplexitiesarethesame,

theprocessororworkcomplexityis

asymptoticallylessinmodelBas

comparedtomodelA.

Advanced Topics in Algorithms and Data

Structures

ClassificationofthePRAM

model

AnalgorithmdesignedforastrongerPRAM

modelcanbesimulatedonaweakermodel

eitherwithasymptoticallymoreprocessors

(work)orwithasymptoticallymoretime.

Advanced Topics in Algorithms and Data

Structures

AddingnnumbersonaPRAM

AddingnnumbersonaPRAM

Advanced Topics in Algorithms and Data

Structures

AddingnnumbersonaPRAM

ThisalgorithmworksontheEREWPRAM

modelastherearenoreadorwrite

conflicts.

Wewillusethisalgorithmtodesigna

matrixmultiplicationalgorithmonthe

EREWPRAM.

PRAMs are classified as EREW, CREW and CRCW.

EREW. In the exclusive-read-exclusive-write (EREW) PRAM model, no

conflicts are permitted for either reading or writing. If, during the

execution

of a program on this model, some conflict occurs, the programs

behavior is undefined.

CREW. In the concurrent-read-exclusive-write (CREW) PRAM model,

simultaneous

readings by many processors from some memory cell are

permitted. However, if a writing conflict occurs, the behavior of the

program is undefined. The idea behind this model is that it may be

cheap to implement some broadcasting primitive on a real machine,

so

one needs to examine the usefulness of such a decision.

CRCW. Finally, the concurrent-read-concurrent-write

(CRCW) PRAM,

the strongest of these models, permits simultaneous

accesses for both

reading and writing. In the case of multiple processors

trying to write

to some memory cell, one must define which of the

processors eventually

does the writing. There are several answers that

researchers have

given to this question. The most commonly used are:1

Theorem 8 Any algorithm written for the PRIORITY CRCW

model can

be simulated on the EREW model with a slowdown of O(lg r)

where r is the number of processors employed by the

algorithm.

Proof. We have an algorithm that runs correctly on a

PRIORITY

CRCW PRAM, and a EREW PRAM machine on which we

want to simulate

the algorithm. If we were to execute the algorithm without

modification on

the EREW machine, it would not work. The problem would

not be in the

executable part of the code, since both machines understand the same set

of executable statements. Instead, the problem would be in the statements

that access the shared memory for reading or writing. For example, every

time the algorithm says:

Processor Pi reads from memory location y into x

it might involve a concurrent reading which the EREW machine cannot

handle. The same is true for a concurrent write statement (depicted visually

in figure 7.8). In order to fix the problem, we have to simulate each

statement of that sort into a sequence of statements that do not involve any

concurrency, but have the same result as if we had concurrency.

Let us assume that the algorithm uses r processors, named P1, P2 . . . , Pr;

the EREW machine also uses r processors. The EREW machine, however,

will need a little more memory: r auxiliary memory locations A[1..r], which

will be used to resolve the conflicts.

The idea is to replace each fragment code of the algorithm:

Processor Pi accesses (reads into x or writes x into) memory

location y

with code which:

a) has the processors request permission to access a

particular memory

location,

b) finds out, for every memory location, whether there is

conflict, and

c) decides which processsor of the competing will do the

access.

This is achieved by the following fragment of code:

1. Processor Pi writes (y, i) into A[i]

2. Auxiliary array A[1..r] is sorted lexicographically in increasing order.

3. Processor Pi reads A[i 1] and A[i] and determines whether it is

the highest priority processor accessing some memory location

4. If Pi is the highest priority processor, then:

If the operation was a write, Pi does the writing

Else Pi does the reading, and the value read is propagated

to the processors interested in this value.

The last step takes O(lg r) time. The sorting step also takes O(lg r), as

the following non-trivial fact shows. (We mention it here without

proof. For

a proof, consult [?].)

41

Prefix Sum - Doubling

CREW PRAM

Given: n elements in A*0 n-1]

Var: A & j are global, i is local

spawn (P1, P2,..P

n-1

) // note # of PC

For all Pi, 1 <= i <= n -1)

for j = 0 to log n - 1 do

if (i - 2

j

>= 0)

A[i] = A[i] + A[i - 2

j

]

42

Sum of elements EREW PRAM

Given: n elements in A*0 n-1]

Var: A & j are global, i is local

spawn (P0, P1, P2,..P

n/2-1

)// P = n/2

For all Pi, 0 <= i <= (n/2 -1)

for j = 0 to log n - 1 do

if (i mod 2

j

= 0) & (2i + 2

j

< n)

A[2i] = A[2i] + A[2i + 2

j

]

Algorithm

Let A and B be the given shared arrays of r and s elements respectively, sorted in

nondecreasing

order. It is required to merge them into a shared array C.

As presented by Akl[4], the algorithm is as follows:

Let P1, P2... PN be N processors available.

Step 1: The Algorithm selects N-1 elements from array A for a shared array A'. This divides

A

into N approximately equal size segments. A shared array B of N-1 elements of B is chosen

similarly.

For this step, each Pi inserts A[ i * r/N ] and B[ i * r/N ] in parallel into the ith location of A

and B

respectively.

Step 2: This step merges A and B into a shared array V of size 2N - 2. Each element v of V

is a

triple consisting of an element A or B followed by its position in A or B followed by the

name A or B.

For this step each Pi:

a. Using Sequential BINARY SEARCH each processor searches the array B in parallel to find

the

smallest j such that A*i+ < B*j+. If such j exists, then V*i + j -1+ is set by the triple (A*i], i,

A),

otherwise V[i + N -1+ is set by the triple (A*i], i, A).

b. Using Sequential BINARY SEARCH each processor searches the array A to find the

smallest j

such that B*i+ < A*j+. If such j exists, then V*i + j -1] is set by the triple

(B*i], i, B), otherwise V*i + N -1+ is set by the triple (B*i], i, B).

Step 3: To merge A and B into shared array C, the indices of two elements (one in A

and one in

B) at which each processor is to begin merging are computed in a shared array Q of

ordered pairs. This step

is executed as follows:

a. P1 sets Q[1] by (1,1).

b. Each Pi , i >=2, checks if V[2i -2+ is equal to (A*k+, k, A) or not. If it is equal then Pi

searches B

using BINARY SEARCH to find the smallest j such that B*j+ > A*k+ and sets Q*i] by (k *

r/N, j),

otherwise Pi searches A using BINARY SEARCH to find the smallest j such that

A*j+ > B*k+ and sets Q*i] by (j, k * s/N ).

Step 4: Each Pi, i < N uses the sequential merge, and Q[i] = (x, y), and Q[i+1] = (u, v) to

merge

two subarrays A[x..u-1] and B[y..v-1] and places the results of the merge in array C at

position x + y - 1.

Processor PN uses Q[N] = (w, z) to merge two subarrays A[w..r] and B[z..s].

45

List ranking EREW algorithm

LIST-RANK(L) (in O(lg n) time)

1. for each processor i, in parallel

2. do if next[i]=nil

3. then d[i]0

4. else d[i]1

5. while there exists an object i such that next[i]=nil

6. do for each processor i, in parallel

7. do if next[i]=nil

8. then d[i] d[i]+ d[next[i]]

9. next[i] next[next[i]]

46

List-ranking EREW algorithm

1

3

1

4

1

6

1

1

1

0

0

5

(a)

3

4 6 1 0

5

(b)

2

2 2 2 1 0

3

4 6 1 0

5

(c)

4

4 3 2 1 0

3

4 6 1 0

5

(d)

5

4 3 2 1 0

47

Applications of List Ranking

Expression Tree Evaluation

Parentheses Matching

Tree Traversals

EarDecomposition of Graphs

Euler tour of trees

1

Graph coloring

Determining the vertices of a graph can be colored with c colors so that no two

adjacent vertices are assigned the same color is called the

graph coloring problem. To solve the problem quickly, we can create a

processor for every possible coloring of the graph, then each processor checks

to see if the coloring it represents is valid.

49

Linear Arrays and Rings

Linear Array

Asymmetric network

Degree d=2

Diameter D=N-1

Bisection bandwidth: b=1

Allows for using different sections of the channel by different sources concurrently.

Ring

d=2

D=N-1 for unidirectional ring or for bidirectional ring

Linear Array

Ring

Ring arranged to use short wires

2 / N D =

50

Ring

Fully Connected Topology

Needs N(N-1)/2 links to connect N

processor nodes.

Example

N=16 -> 136 connections.

N=1,024 -> 524,288 connections

D=1

d=N-1

Chordal ring

Example

N=16, d=3 -> D=5

51

Multidimensional Meshes and Tori

Mesh

Popular topology, particularly for SIMD architectures since they match many data

parallel applications (eg image processing, weather forecasting).

Illiac IV, Goodyear MPP, CM-2, Intel Paragon

Asymmetric

d= 2k except at boundary nodes.

k-dimensional mesh has N=n

k

nodes.

Torus

Mesh with looping connections at the boundaries to provide symmetry.

2D Grid

3D Cube

52

Trees

Diameter and ave distance logarithmic

k-ary tree, height d = log

k

N

address specified d-vector of radix k coordinates describing path down

from root

Fixed degree

Route up to common ancestor and down

Bisection BW?

53

Trees (cont.)

Fat tree

The channel width increases as we go up

Solves bottleneck problem toward the root

Star

Two level tree with d=N-1, D=2

Centralized supervisor node

54

Hypercubes

Each PE is connected to (d = log N) other PEs

d = log N

Binary labels of neighbor PEs differ in only one bit

A d-dimensional hypercube can be partitioned into two (d-1)-dimensional

hypercubes

The distance between Pi and Pj in a hypercube: the number of bit positions in

which i and j differ (ie. the Hamming distance)

Example:

10011 01001 = 11010

Distance between PE11 and PE9 is 3

0-D 1-D 2-D 3-D 4-D

5-D

001 011

000 010

100 110

111

101

*From Parallel Computer Architectures; A Hardware/Software approach, D. E. Culler

55

Hypercube routing functions

Example

Consider 4D hypercube (n=4)

Source address s = 0110 and destination address

d = 1101

Direction bits r = 0110 1101 = 1011

1. Route from 0110 to 0111 because r = 1011

2. Route from 0111 to 0101 because r = 1011

3. Skip dimension 3 because r = 1011

4. Route from 0101 to 1101 because r = 1011

56

k-ary n-cubes

Rings, meshes, torii and hypercubes are special cases

of a general topology called a k-ary n-cube

Has n dimensions with k nodes along each dimension

An n processor ring is a n-ary 1-cube

An nxn mesh is a n-ary 2-cube (without end-around

connections)

An n-dimensional hypercube is a 2-ary n-cube

N=k

n

Routing distance is minimized for topologies with

higher dimension

Cost is lowest for lower dimension. Scalability is also

greatest and VLSI layout is easiest.

57

Cube-connected cycle

d=3

D=2k-1+

Example N=8

We can use the 2CCC network

2 / k

58

59

Network properties

Node degree d - the number of edges incident on

a node.

In degree

Out degree

Diameter D of a network is the maximum

shortest path between any two nodes.

The network is symmetric if it looks the same

from any node.

The network is scalable if it expandable with

scalable performance when the machine

resources are increased.

60

Bisection width

Bisection width is the minimum number of wires that must be cut to

divide the network into two equal halves.

Small bisection width -> low bandwidth

A large bisection width -> a lot of extra wires

A cut of a network C(N1,N2) is a set of channels that partition the set of all

nodes into two disjoint sets N1 and N2. Each element of C(N1,N2) is a

channel with a source in N1 and destination in N2 or vice versa.

A bisection of a network is a cut that partitions the entire network nearly

in half, such that |N2||N1||N2+1|. Here |N2| means the number of

nodes that belong to the partition N2.

The channel bisection of a network is the minimum channel count over all

bisections of the network:

| ) 2 , 1 ( | min

sec

N N C Bc

tions bi

=

61

Factors Affecting Performance

Functionality how the network supports

data routing, interrupt handling,

synchronization, request/message combining,

and coherence

Network latency worst-case time for a unit

message to be transferred

Bandwidth maximum data rate

Hardware complexity implementation costs

for wire, logic, switches, connectors, etc.

62

2 2 Switches

*From Advanced Computer Architectures, K. Hwang, 1993.

63

Switches

Module size Legitimate states Permutation connection

2 2 4 2

4 4 256 24

8 8 16,777,216 40,320

N N N

N

N!

Permutation function: each input can only be connected a

single output.

Legitimate state: Each input can be connected to multiple

outputs, but each output can only be connected to a single

input

64

Single-stage networks

Single stage Shuffle-Exchange IN (left)

Perfect shuffle mapping function (right)

Perfect shuffle operation: cyclic shift 1

place left, eg 101 --> 011

Exchange operation: invert least

significant bit, e.g. 101 --> 100

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003

65

Multistage Interconnection Networks

The capability of single stage networks are limited but if we cascade enough of

them together, they form a completely connected MIN (Multistage

Interconnection Network).

Switches can perform their own routing or can be controlled by a central router

This type of networks can be classified into the following four categories:

Nonblocking

A network is called strictly nonblocking if it can connect any idle input to any idle output

regardless of what other connections are currently in process

Rearrangeable nonblocking

In this case a network should be able to establish all possible connections between

inputs and outputs by rearranging its existing connections.

Blocking interconnection

A network is said to be blocking if it can perform many, but not all, possible connections

between terminals.

Example: the Omega network

66

Omega networks

A multi-stage IN using 2 2 switch boxes and a perfect shuffle

interconnect pattern between the stages

In the Omega MIN there is one unique path from each input to each

output.

No redundant paths no fault tolerance and the possibility of blocking

Example:

Connect input 101 to output

001

Use the bits of the destination

address, 001, for dynamically

selecting a path

Routing:

- 0 means use upper output

- 1 means use lower output

*From Ben Macey at http://www.ee.uwa.edu.au/~maceyb/aca319-2003

67

Omega networks

log

2

N stages of 2 2 switches

N/2 switches per stage

S=(N/2) log

2

(N) switches

Number of permutations in a omega network

2

S

68

Baseline networks

The network can be generated recursively

The first stage N N, the second (N/2) (N/2)

Networks are topologically equivalent if one network can be easily

reproduced from the other networks by simply rearranging nodes at each

stage.

*From Advanced Computer Architectures, K. Hwang, 1993.

69

Crossbar Network

Each junction is a switching component

connecting the row to the column.

Can only have one connection in each column

*From Advanced Computer Architectures, K. Hwang, 1993.

70

Crossbar Network

The major advantage of the cross-bar switch is its

potential for speed.

In one clock, a connection can be made between

source and destination.

The diameter of the cross-bar is one.

Blocking if the destination is in use

Because of its complexity, the cost of the cross-

bar switch can become the dominant factor for a

large multiprocessor system.

Crossbars can be used to implement the ab

switches used in MINs. In this case each crossbar

is small so costs are kept down.

71

Performance Comparison

Network Latency Switching

complexity

Wiring

complexity

Blocking

Bus Constant

O(N)

O(1) O(w) yes

MIN O(log

2

N) O(Nlog

2

N) O(Nw log

2

N)

yes

Crossbar O(1) O(N

2

) O(N

2

w) no

PCAM Algorithm Design

Partitioning

Computation and data are decomposed.

Communication

Coordinate task execution

Agglomeration

Combining of tasks for performance

Mapping

Assignment of tasks to processors

Partitioning

Ignore the number of processors and the

target architecture.

Expose opportunities for parallelism.

Divide up both the computation and data

Can take two approaches

domain decomposition

functional decomposition

Domain Decomposition

Start algorithm design by analyzing the data

Divide the data into small pieces

Approximately equal in size

Then partition the computation by associating

it with the data.

Communication issues may arise as one task

needs the data from another task.

Functional Decomposition

Focus on the computation

Divide the computation into disjoint tasks

Avoid data dependency among tasks

After dividing the computation, examine the

data requirements of each task.

Functional Decomposition

Not as natural as domain

decomposition

Consider search problems

Often functional

decomposition is very useful at

a higher level.

Climate modeling

Ocean simulation

Hydrology

Atmosphere, etc.

Communication

The information flow between tasks is

specified in this stage of the design

Remember:

Tasks execute concurrently.

Data dependencies may limit concurrency.

Communication

Define Channel

Link the producers with the consumers.

Consider the costs

Intellectual

Physical

Distribute the communication.

Specify the messages that are sent.

Communication Patterns

Local vs. Global

Structured vs. Unstructured

Static vs. Dynamic

Synchronous vs. Asynchronous

Local Communication

Communication within a neighborhood.

Algorithm choice determines communication.

Global Communication

Not localized.

Examples

All-to-All

Master-Worker

5

3

7

2

1

Structured Communication

Each tasks communication resembles each

other tasks communication

Is there a pattern?

Unstructured Communication

No regular pattern that can

be exploited.

Examples

Unstructured Grid

Resolution changes

Complicates the next stages of design

Synchronous Communication

Both consumers and producers are aware

when communication is required

Explicit and simple

t = 1 t = 2 t = 3

Asynchronous Communication

Timing of send/receive is unknown.

No pattern

Consider: very large data structure

Distribute among computational tasks (polling)

Define a set of read/write tasks

Shared Memory

Agglomeration

Partition and Communication steps were

abstract

Agglomeration moves to concrete.

Combine tasks to execute efficiently on some

parallel computer.

Consider replication.

Mapping

Specify where each task is to operate.

Mapping may need to change depending on

the target architecture.

Mapping is NP-complete.

Mapping

Goal: Reduce Execution Time

Concurrent tasks ---> Different processors

High communication ---> Same processor

Mapping is a game of trade-offs.

Mapping

Many domain-decomposition problems make

mapping easy.

Grids

Arrays

etc.

91

Speedup in Simplest Terms

Speed Up= Sequential Access Time/ Parallel Access Time

Quinns notation for speedup is

+(n,p)

for data size n and p processors.

92

Linear Speedup Usually Optimal

Speedup is linear if S(n) = O(n)

Theorem: The maximum possible speedup for parallel

computers with n PEs for traditional problems is n.

Proof:

Assume a computation is partitioned perfectly into n

processes of equal duration.

Assume no overhead is incurred as a result of this

partitioning of the computation (e.g., partitioning

process, information passing, coordination of processes,

etc),

Under these ideal conditions, the parallel computation will

execute n times faster than the sequential computation.

The parallel running time is t

s

/n.

Then the parallel speedup of this computation is

S(n) = t

s

/(t

s

/n) = n

93

Linear Speedup Usually Optimal (cont)

We shall later see that this proof is not valid for

certain types of nontraditional problems.

Unfortunately, the best speedup possible for most

applications is much smaller than n

The optimal performance assumed in last proof is

unattainable.

Usually some parts of programs are sequential and allow

only one PE to be active.

Sometimes a large number of processors are idle for

certain portions of the program.

During parts of the execution, many PEs may be waiting

to receive or to send data.

E.g., recall blocking can occur in message passing

94

Superlinear Speedup

Superlinear speedup occurs when S(n) > n

Most texts besides Akls and Quinns argue that

Linear speedup is the maximum speedup obtainable.

The preceding proof is used to argue that

superlinearity is always impossible.

Occasionally speedup that appears to be superlinear may

occur, but can be explained by other reasons such as

the extra memory in parallel system.

a sub-optimal sequential algorithm used.

luck, in case of algorithm that has a random aspect in

its design (e.g., random selection)

95

Superlinearity (cont)

Selim Akl has given a multitude of examples that establish that superlinear

algorithms are required for many nonstandad problems

If a problem either cannot be solved or cannot be solved in the

required time without the use of parallel computation, it seems fair to

say that t

s

=.

Since for a fixed t

p

>0, S(n) = t

s

/t

p

is greater than 1 for all sufficiently

large values of t

s

, it seems reasonable to consider these solutions

to be superlinear.

Examples include nonstandard problems involving

Real-Time requirements where meeting deadlines is part of the

problem requirements.

Problems where all data is not initially available, but has to be

processed after it arrives.

Real life situations such as a person who can only keep a driveway

open during a severe snowstorm with the help of friends.

Some problems are natural to solve using parallelism and sequential

solutions are inefficient.

96

Superlinearity (cont)

The last chapter of Akls textbook and several journal

papers by Akl were written to establish that

superlinearity can occur.

It may still be a long time before the possibility of

superlinearity occurring is fully accepted.

Superlinearity has long been a hotly debated topic and is

unlikely to be widely accepted quickly.

For more details on superlinearity, see *2+ Parallel

Computation: Models and Methods, Selim Akl, pgs 14-20

(Speedup Folklore Theorem) and Chapter 12.

This material is covered in more detail in my PDA class.

97

Speedup Analysis

Recall speedup definition: +(n,p) = t

s

/t

p

A bound on the maximum speedup is given by

+(n,p) = [o(n) +(n)]/[o(n) +(n)/p + k(n,p)]

Inherently sequential computations are o(n)

Potentially parallel computations are (n)

Communication operations are k(n,p)

The bound above is due to the assumption in formula

that the speedup of the parallel portion of computation

will be exactly p.

Note k(n,p) =0 for SIMDs, since communication steps are

usually included with computation steps.

98

Execution time for parallel portion (n)/p

Shows nontrivial parallel algorithms computation

component as a decreasing function of the

number of processors used.

processors

time

99

Time for communication

k(n,p)

Shows a nontrivial parallel algorithms

communication component as an increasing

function of the number of processors.

processors

time

100

Execution Time of Parallel Portion

(n)/p + k(n,p)

Combining these, we see for a fixed problem size,

there is an optimum number of processors that

minimizes overall execution time.

processors

time

101

Speedup Plot

elbowing out

processors

speedup

102

Cost

The cost of a parallel algorithm (or program) is

Cost = Parallel running time #processors

Since cost is a much overused word, the term

algorithm cost is sometimes used for clarity.

The cost of a parallel algorithm should be compared

to the running time of a sequential algorithm.

Cost removes the advantage of parallelism by

charging for each additional processor.

A parallel algorithm whose cost is big-oh of the

running time of an optimal sequential algorithm is

called cost-optimal.

103

Cost Optimal

From last slide, a parallel algorithm is optimal if

parallel cost = O(f(t)),

where f(t) is the running time of an optimal

sequential algorithm.

Equivalently, a parallel algorithm for a problem is said

to be cost-optimal if its cost is proportional to the

running time of an optimal sequential algorithm for

the same problem.

By proportional, we means that

cost t

p

n = k t

s

where k is a constant and n is nr of processors.

In cases where no optimal sequential algorithm is

known, then the fastest known sequential

algorithm is sometimes used instead.

104

Efficiency

used Processors

Speedup

Efficiency

time execution Parallel used Processors

time execution Sequential

Efficiency

=

=

processors p on n size of

problem a for p) (n, by Quinn in denoted Efficiency

Cost

time running Sequential

Efficiency

Processors

Speedup

Efficiency

time running Parallel Processors

time running Sequential

Efficiency

c

=

=

=

105

Bounds on Efficiency

Recall

(1)

For algorithms for traditional problems, superlinearity is not

possible and

(2) speedup processors

Since speedup 0 and processors > 1, it follows from the

above two equations that

0 s c(n,p) s 1

Algorithms for non-traditional problems also satisfy 0 s

c(n,p). However, for superlinear algorithms if follows that

c(n,p) > 1 since speedup > p.

p

speedup

processors

speedup

efficiency = =

106

Amdahls Law

Let f be the fraction of operations in a

computation that must be performed

sequentially, where 0 f 1. The maximum

speedup achievable by a parallel computer

with n processors is

f n f f

p S

1

/ ) 1 (

1

) ( s

+

s

The word law is often used by computer scientists when it is an observed

phenomena (e.g, Moores Law) and not a theorem that has been proven in a strict

sense.

However, Amdahls law can be proved for traditional problems

107

Proof for Traditional Problems: If the fraction of the

computation that cannot be divided into concurrent tasks is f,

and no overhead incurs when the computation is divided into

concurrent parts, the time to perform the computation with n

processors is given by t

p

ft

s

+ [(1 - f )t

s

] / n, as shown below:

108

Proof of Amdahls Law (cont.)

Using the preceding expression for t

p

The last expression is obtained by dividing numerator and

denominator by t

s

, which establishes Amdahls law.

Multiplying numerator & denominator by n produces the

following alternate version of this formula:

n

f

f

n

t f

f t

t

t

t

n S

s

s

s

p

s

) 1 (

1

) 1 (

) (

+

=

+

s =

f n

n

f nf

n

n S

) 1 ( 1 ) 1 (

) (

+

=

+

s

109

Amdahls Law

Preceding proof assumes that speedup can not be

superliner; i.e.,

S(n) = t

s

/ t

p

s n

Assumption only valid for traditional problems.

Question: Where is this assumption used?

The pictorial portion of this argument is taken from

chapter 1 of Wilkinson and Allen

Sometimes Amdahls law is just stated as

S(n) s 1/f

Note that S(n) never exceeds 1/f and approaches

1/f as n increases.

110

Consequences of Amdahls Limitations to

Parallelism

For a long time, Amdahls law was viewed as a fatal flaw to the

usefulness of parallelism.

Amdahls law is valid for traditional problems and has several

useful interpretations.

Some textbooks show how Amdahls law can be used to

increase the efficient of parallel algorithms

See Reference (16), Jordan & Alaghband textbook

Amdahls law shows that efforts required to further reduce

the fraction of the code that is sequential may pay off in large

performance gains.

Hardware that achieves even a small decrease in the percent

of things executed sequentially may be considerably more

efficient.

111

Limitations of Amdahls Law

A key flaw in past arguments that Amdahls law is

a fatal limit to the future of parallelism is

Gustafons Law: The proportion of the computations

that are sequential normally decreases as the problem

size increases.

Note: Gustafons law is a observed phenomena and not a

theorem.

Other limitations in applying Amdahls Law:

Its proof focuses on the steps in a particular algorithm,

and does not consider that other algorithms with more

parallelism may exist

Amdahls law applies only to standard problems were

superlinearity can not occur

112

Other Limitations of Amdahls Law

Recall

Amdahls law ignores the communication cost

k(n,p)n in MIMD systems.

This term does not occur in SIMD systems, as

communications routing steps are deterministic and

counted as part of computation cost.

On communications-intensive applications, even the

k(n,p) term does not capture the additional

communication slowdown due to network

congestion.

As a result, Amdahls law usually overestimates

speedup achievable

) , ( / ) ( ) (

) ( ) (

) , (

p n p n n

n n

p n

k o

o

+ +

+

s

113

Amdahl Effect

Typically communications time k(n,p) has

lower complexity than (n)/p (i.e., time for

parallel part)

As n increases, (n)/p dominates k(n,p)

As n increases,

sequential portion of algorithm decreases

speedup increases

Amdahl Effect: Speedup is usually an

increasing function of the problem size.

114

Illustration of Amdahl Effect

n = 100

n = 1,000

n = 10,000

Speedup

Processors

115

The Isoefficiency Metric

(Terminology)

Parallel system a parallel program executing

on a parallel computer

Scalability of a parallel system - a measure of

its ability to increase performance as number

of processors increases

A scalable system maintains efficiency as

processors are added

Isoefficiency - a way to measure scalability

116

Notation Needed for the

Isoefficiency Relation

n data size

p number of processors

T(n,p) Execution time, using p processors

+(n,p) speedup

o(n) Inherently sequential computations

(n) Potentially parallel computations

k(n,p) Communication operations

c(n,p) Efficiency

Note: At least in some printings, there appears to be a misprint on page 170

in Quinns textbook, with (n) being sometimes replaced with |(n). To

correct, simply replace each | with .

117

Isoefficiency Concepts

T

0

(n,p) is the total time spent by processes

doing work not done by sequential algorithm.

T

0

(n,p) = (p-1)o(n) + pk(n,p)

We want the algorithm to maintain a constant

level of efficiency as the data size n increases.

Hence, c(n,p) is required to be a constant.

Recall that T(n,1) represents the sequential

execution time.

118

The Isoefficiency Relation

Suppose a parallel system exhibits efficiency c(n,p). Define

In order to maintain the same level of efficiency as the number of

processors increases, n must be increased so that the following

inequality is satisfied.

) , ( ) ( ) 1 ( ) , ( T

) , ( 1

) , (

0

p n p n p p n

p n

p n

C

k o

c

c

+ =

=

) , ( ) 1 , (

0

p n CT n T >

119

Isoefficiency Relation Derivation

(See page 170-17 in Quinn)

MAIN STEPS:

Begin with speedup formula

Compute total amount of overhead

Assume efficiency remains constant

Determine relation between sequential

execution time and overhead

120

Deriving Isoefficiency Relation

(see Quinn, pgs 170-17)

) , ( ) ( ) 1 ( ) , ( p n p n p p n T

o

k o + =

Determine overhead

Substitute overhead into speedup equation

) , ( ) ( ) (

)) ( ) ( (

0

) , (

p n T n n

n n p

p n

+ +

+

s

o

o

) , ( ) 1 , (

0

p n CT n T >

Isoefficiency Relation

121

Isoefficiency Relation Usage

Used to determine the range of processors for which

a given level of efficiency can be maintained

The way to maintain a given efficiency is to increase

the problem size when the number of processors

increase.

The maximum problem size we can solve is limited by

the amount of memory available

The memory size is a constant multiple of the

number of processors for most parallel systems

122

The Scalability Function

Suppose the isoefficiency relation reduces to n

> f(p)

Let M(n) denote memory required for

problem of size n

M(f(p))/p shows how memory usage per

processor must increase to maintain same

efficiency

We call M(f(p))/p the scalability function [i.e.,

scale(p) = M(f(p))/p) ]

123

Meaning of Scalability Function

To maintain efficiency when increasing p, we must

increase n

Maximum problem size is limited by available

memory, which increases linearly with p

Scalability function shows how memory usage per

processor must grow to maintain efficiency

If the scalability function is a constant this means the

parallel system is perfectly scalable

124

Interpreting Scalability Function

Number of processors

M

e

m

o

r

y

n

e

e

d

e

d

p

e

r

p

r

o

c

e

s

s

o

r

Cplogp

Cp

Clogp

C

Memory Size

Can maintain

efficiency

Cannot maintain

efficiency

CSE 160/Berman

Odd-Even Transposition Sort

Parallel version of bubblesort many compare-

exchanges done simultaneously

Algorithm consists of Odd Phases and Even

Phases

In even phase, even-numbered processes exchange

numbers (via messages) with their right neighbor

In odd phase, odd-numbered processes exchange

numbers (via messages) with their right neighbor

Algorithm alternates odd phase and even phase

for O(n) iterations

CSE 160/Berman

Odd-Even Transposition Sort

Data Movement

General Pattern for n=5

P0 P1 P2 P3 P4

T=1

T=2

T=3

T=4

T=5

CSE 160/Berman

Odd-Even Transposition Sort

Example

General Pattern for n=5

P0 P1 P2 P3 P4

3 10 4 8 1

3 4 10 1 8

3 4 1 10 8

3 1 4 8 10

T=1

T=2

T=3

T=4

1 3 4 8 10

T=5

3 10 4 8 1

T=0

CSE 160/Berman

Odd-Even Transposition Code

Compare-exchange accomplished through message passing

Odd Phase

Even Phase

P_i = 0, 2,4,,n-2

recv(&A, P_i+1);

send(&B, P_i+1);

if (A<B) B=A;

P_i = 1,3,5,,n-1

send(&A, P_i-1);

recv(&B, P_i-1);

if (A<B) A=B;

P_i = 2,4,6,,n-2

send(&A, P_i-1);

recv(&B, P_i-1);

if (A<B) A=B;

P_i = 1,3,5,,n-3

recv(&A, P_i+1);

send(&B, P_i+1);

if (A<B) B=A;

P0 P1 P2 P3 P4

P0 P1 P2 P3 P4

9/06/99

Sorting on the CRCW and

CREW PRAMs

Sort on the CRCW PRAM

Similar idea used to design MIN-CRCW

Need more powerful (less realistic) model of resolving concurrent

writes that sums the value to be concurrently written

Use n(n-1)/2 processors and in a constant time

Drawback : processors and powerful model

Sort on the CREW PRAM

uses an auxiliary two-dimensional array Win[1:n,1:n]

separate write and separate sum in O(1) and O(log n) using O(n**2)

processors

9/06/99

9/06/99

Odd-Even Merge Sort on

the EREW PRAM

Sort EREW PRAM model : use too many processors

Merge Sort on the EREW PRAM

Based on the idea of divide-and-conquer

Idea : divide a list into two sub-lists, recursively sort and

merge

Use n processors with complexity O(n)

S(n) = O(log n), C(n) = O(n**2)

Odd-Even Merge Sort on the EREW PRAM

Speedup the Merge process

Make Odd-index list and Even-index list, and recursive merge

Last do a even-odd compare exchange step

9/06/99

Odd-Even Merge Sort on the

EREW PRAM : contd

Odd-Even Merge Sort on the EREW PRAM

number of processors used : n

W(n) = 1+ 2+ + log n = O((log n)**2)

C(n) = O(n log2n)

9/06/99

9/06/99

9/06/99

Sorting on the One-

dimensional Mesh

Any comparison-based parallel sorting algorithm on Mp

must perform at least n-1 communications steps to

properly decide between the relative order of the

elements in P1 and Pn.

Can achieve a speedup of at most log n on the one-

dimensional mesh

Two algorithms

Insertion Sort

Odd-Even Transposition Sort

9/06/99

Sorting on the Two-

dimensional Mesh

Order in the two-dimensional mesh

Row, column-major orders

Snake order

Snake-order sorting

Repetition of row and column sort

When column sort, the direction is snake-order direction

W(n) = (\ceil(log n) + 1)\root(n) ~= \root(n)

C(n) = nW(n)

S(n) ~= \root(n)

9/06/99

9/06/99

9/06/99

Bitonic MergeSort

EREW PRAM

What is Bitonic List?

A sequence of numbers, x1, , xn, with the property that

there exists an index I such that x1<x2<<xi and xi>xi+1>>xn or else

there exists a cyclic shift of indices so that condition (1) holds

What is rearrangement?

For X=(x1, , xn), X = (x1, ., xn) is defined as

xi = min (xi, xi+n/2), xi+n/2 = max(xi, xi+n/2)

Property

Let A and B be the sub-lists of X after the rearrangement

Then A and B are bitonic lists.

And any element in B is larger than all elements in A.

9/06/99

9/06/99

Bitonic MergeSort EREW

PRAM : contd

Bitonic Sort

Input : A bitonic list

Output : A sorted list

Algorithm :

Recursive rearrangement

Bitonic Merge

Merge of two increasing-order lists as one sorted list

Change the index for rearrangement : (i + n/2) => (n-i+1)

results are two bitonic sub-lists

Call Bitonic Sort for two different lists

Bitonic MergeSort Algorithm

Input : a random list

Output : a sorted list

Recursive call of Bitonic MergeSort and following call of

Bitonic Merge

9/06/99

9/06/99

9/06/99

Bitonic MergeSort EREW

PRAM : contd

Bitonic MergeSort Complexity

W(n) = O((log n)**2)

C(n) = O(n log**2 n))

S(n) = O(n / log n)

Bitonic Merge Sorting Network

See the figure

9/06/99

- IEEE VLSI PROJECTS 2012Hochgeladen vonMahesh Kanike
- Advanced Computer Architecture April2003 NR 321502Hochgeladen vonNizam Institute of Engineering and Technology Library
- NVIDIA_OpenCL_BestPracticesGuideHochgeladen vonadjc387
- MC-openmpHochgeladen vonBui Khoa Nguyen Dang
- Lecture 1Hochgeladen vonAnonymous nxylA3
- Formula 1 Brake Intake OptimizationHochgeladen vongosculptor
- Advance CompuHochgeladen vonstvx007
- 7 FullHochgeladen vonErJay Surti
- 16f8xHochgeladen vontaditimnhau
- Unit10 Final Paper-Microprocessors Artificial Intellect and Solar System ExplorationHochgeladen vonJames Calhoun
- Cr08 09042803f-Final-9 Published CopyHochgeladen vonMuhammad Rajibul Islam
- The Design and Analysis of Parallel Algorithm by S.G.aklHochgeladen vonVikram Singh
- mp1Hochgeladen vondubstepo
- Parallel Algorithm Lecture NotesHochgeladen vonDeepti Gupta
- Mult CoreHochgeladen vonDimitra
- 01 01 Parallel Computing ExplainedHochgeladen vonmanas_kirti
- Parallel AlgorithmHochgeladen vonMariana 'Onit' Saragi
- Feb07 Parallel DatabasesHochgeladen vonvidhya_bineesh
- Parallel ArchHochgeladen vonmanishbhardwaj8131
- Introduction to Systems Architecture_Chapter1Hochgeladen vonmishra_punit
- IDAPython-Book.pdfHochgeladen vonr3vhack
- Microprocessors Lecture 2.docxHochgeladen vonAriane Faye Salonga
- Writing Better R CodeHochgeladen vonMykhailo Koltsov
- Algorithm AbstractionsHochgeladen vonbabarirfanali
- Advance Computer Architecture2Hochgeladen vonAnuj
- Pipeline Datapaths 10Hochgeladen vonSC Priyadarshani de Silva
- Arrays Considered Somewhat HarmfulHochgeladen vonMithdraug
- 11_reportHochgeladen vonMaher Khalaf Hussien
- FPGA1Hochgeladen vonsemiramist
- Multiprocessor conceptsHochgeladen vonVanishree

- question paper for competitive examsHochgeladen vonLEX 47
- Download UPSC Pre 2014 General Studies Paper I Booklet Series D Www.iasexamportal.comHochgeladen vonKatie Porter
- VocabHochgeladen vonswa_shanu
- The C Puzzle BookHochgeladen vonabhijeetnayak
- Software Engineering Lecture Plan- BTechHochgeladen vonswa_shanu

- shields - kipp course scope sequence - 2014-2015 revised 7 14 14Hochgeladen vonapi-259916083
- Channel Partner Acknowledgement LetterHochgeladen vonRaman Kankal
- AlgorithmHochgeladen vonsankar_mca227390
- MCQ-IIHochgeladen vonChinmay Mohapatra
- PrelimsHochgeladen vonnicolas_urdaneta
- CEH Module 00 - Welcome to Certified Ethical Hacker Class!Hochgeladen vonCarlos Salgueiro Villarreal
- HTML Programmmming.docxHochgeladen vonvkocf
- RFC2544 427NBHochgeladen vonPinto Pio
- Solaris 10 Boot Process SparcHochgeladen vonMohaideen
- Chapter 4_ ABAP Language Basics - SAP ABAP_ Hands-On Test Projects With Business ScenariosHochgeladen vonପରିଡା ରଣେଦ୍ର
- Excel Shortcut KeysHochgeladen vonRavindra Desai
- LOAD Product CodesHochgeladen vonChristopher Monserate Piojo
- 2019 Augusta STEM Expo WinnersHochgeladen vonRilyn Eischens
- SAEP-135_Process Automation Systems Obsolescence EvaluationHochgeladen vonQA QC
- 730218Hochgeladen vonMuhammad Agung
- Need to crack OData Interviews.docxHochgeladen vonNilesh Pawar
- Important Cl CommandsHochgeladen vontripati mishra
- 6418C-ENU-TrainerHandbookHochgeladen vonEngine Tuning Up
- Lecture 01Hochgeladen vonziactn
- SolidSim White PaperHochgeladen vonInventor Solidworks
- afman23-110Hochgeladen vonruizar28
- Create a Searchable Drop Down List- Start.xlsxHochgeladen vonFroilan Clerigo Buligan
- Regional SettingsHochgeladen vonSumana Venkatesh
- Google-Temasek E-Conomy SEA Spotlight 2017Hochgeladen vonAHMAD
- ipmiutil-hldHochgeladen vonraamonik
- Frmservlet OldHochgeladen vonMohammad Mahmoud Rabea
- ApacheDaisyHochgeladen vonMurali Padavala
- INV_89_and_OM_89_Architecture_ChangesHochgeladen vonPankil Shah
- BMC PATROL Getting Started Guide - 4.3Hochgeladen vonbtolawoyin
- Vulkan Tutorial.epubHochgeladen vonChrisMarkides

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.