Computo Paralelo

Historia del Cmputo Paralelo Inicia en el ao de 1955 con Gene Amdahl en los Estados Unidos, trabajando en la compaa IBM.
. Que es el procesamiento en paralelo? Es la divisin del trabajo en pequeas tareas. Asignar varias pequeas tareas a mltiples empleados para trabajar simultneamente. El procesamiento en paralelo es el uso de mltiples procesadores para ejecutar diferentes partes del mismo programa simultneamente.
Dificultades: coordinacin, controlar y monitorear trabajadores. Las principales metas del procesamiento en paralelo son: Resolver problemas ms grandes ms rpido. Reducir el tiempo de ejecucin de los programas de cmputo. Incrementar el tamao de los problemas computacionales que se pueden resolver. En la actualidad existen varias lneas de investigacin dentro del cmputo paralelo, as como tambin diversos equipos, lenguajes y aplicaciones.
Ventajas La idea de crear el cmputo paralelo surge de las siguientes necesidades: o efectuar operaciones mucho ms rpido, o procesar grandes volmenes de informacin y o obtener resultados en el menor tiempo posible Estas necesidades son el por qu y a la vez las principales metas del cmputo paralelo. Definicin. Computadora paralela: Es una coleccin de procesadores interconectados de alguna forma dentro de un mismo gabinete, para intercambiar informacin.
Taxonoma (del griego , taxis, "ordenamiento", y , nomos, "norma" o "regla"). Significa ciencia de la clasificacin. La clasificacin de computadoras debe ser en base a las caractersticas ms notorias y no en las ms detalladas que aparecen en las hojas de datos (data sheets). Existen varias taxonomas o clasificaciones de computadoras, como la de Skillicorn, la de Shore (6 tipos), la de Handler, y la taxonoma estructural de Hockney and Jesshope. La ms importante de estas clasificaciones es la de Flynn.
Taxonoma de Flynn: clasifica las computadoras (su arquitectura y sus sistemas operativos) de acuerdo a la capacidad de un sistema de procesar: uno o ms flujos simultneos de datos (data streams).
uno o ms flujos simultneos de instrucciones (instruction stream).
SISD: un flujo de instrucciones nico trabaja sobre flujo de datos nico (arquitectura clsica, superescalares).
SIMD: un flujo de instrucciones nico trabaja sobre un flujo de datos mltiple (computadores matriciales). MISD: un flujo de instrucciones mltiple trabaja sobre un flujo. de datos nico (clase no implementada, resultado de la clasificacin) MIMD: un flujo de instrucciones mltiple trabaja sobre un flujo de datos mltiple (multiprocesadores).
Paradigmas del cmputo paralelo. Paradigmas del software paralelo. Existen varios mtodos de programar computadoras paralelas. Los dos ms comunes son: paso de mensajes y paralelismo de datos. Paso de mensajes (message passing) el usuario hace llamado a libreras para especficamente compartir informacin entre procesadores. Paralelismo de datos (data parallel) la particin de datos determina el paralelismo. Memoria compartida (shared memory) multiples procesadores comparten un espacio de memoria comn.
Operacin remota de memoria (remote memory operation). Un conjunto de procesadores en los cuales un proceso puede accesar la memoria de otro proceso sin su participacin. Hilos ( threads). Un slo proceso tiene mltiples (y concurrentes) rutas de ejecucin. Modelos combinados (combined models). Compuestos de dos o ms modelos de los mencionados arriba. Nota: estos modelos son independientes de la mquina/arquitectura; cualquiera de los modelos puede ser implementado en cualquier hardware con el soporte de un sistema operativo apropiado. Una implementacin efectiva es aquella que se acerque ms al modelo de hardware y le d al usuario facilidad en la programacin.
Message Passing The message passing model is defined as: set of processes using only local memory. processes communicate by sending and receiving messages. data transfer requires cooperative operations to be performed by each process (a send operation must have a matching receive). Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.
Data Parallel The data parallel model is defined as: Each process works on a different part of the same data structure Commonly a Single Program Multiple Data (SPMD) approach Data is distributed across processors All message passing is done invisibly to the programmer Commonly built "on top of" one of the common message passing libraries Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.
Clasificacin del Paralelismo. Temporal (pipeline). o Un programa se ejecuta de manera secuencial y en cierto momento se dividen las tareas a varias unidades de procesamiento. Al terminar la ejecucin en cada unidad, de nuevo se retoma la ejecucin secuencial. Espacial o El paralelismo espacial se produce cuando se tiene varios procesadores y se puede ejecutar un proceso en cada uno de ellos de forma ms o menos independiente. En el caso ptimo, el tiempo de ejecucin se divide por el nmero de procesadores que estn trabajando. Independiente o Este paralelismo no depende de la topologa de la red de procesadores, ya que el programa no adapta su estructura al de las conexiones de la red, es decir, no importa cmo estn conectados los procesadores, la ejecucin del programa en paralelo se realiza.
Niveles de paralelismo: Existen dos cualidades para la programacin en paralelo. Granularidad: Es el tamao relativo de la unidad de cmputo que ejecuta en paralelo. Esta unidad puede ser una declaracin, una funcin o un proceso completo. Canal de comunicacin: Es el mecanismo bsico por el cual las unidades independientes del programa intercambian datos y sincronizan su actividad. Nivel de declaraciones (statement). o Es el nivel mas fino de granularidad. o Se utiliza en lenguajes como Power C, Fortran 79/90, Power Fortran 79/90. o Se usan variables en comn dentro de un sistema nico de memoria.
Nivel de hilos (thread) o Un hilo es un estado independiente de ejecucin, dentro del contexto de un programa ms grande, esto es: Un conjunto de registros mquina Una pila (stack) de llamadas La habilidad de ejecutar cdigo. o Un programa puede crear varios hilos para ejecutarse en el mismo espacio de direccin. Las razones de usar hilos son la portabilidad y el desempeo. Se comparten los recursos de un solo procesador. Nivel de procesos (process). o Un proceso en UNIX consiste en: Un espacio de direccin Un gran nmero de valores de estado de proceso. Un hilo de ejecucin.
o El mecanismo para comunicacin entre procesos puede usarse para Intercambiar datos Coordinar las actividades de mltiples procesos asncronos. o Un proceso puede crear uno o ms procesos, el proceso que crea otro se llama proceso padre y el creado se llama proceso hijo. El proceso inicial se llama raz.
Desempeo. El desempeo de un programa en paralelo se mide bajo tres trminos:
Aceleracin (Speed up) Donde:
Eficiencia
Costo
T = tiempo en segundos T1= Tiempo usando una unidad de procesamiento Tp = Tiempo usando dos o ms de unidades de procesamiento P = nmero de unidades de procesamiento
Ley de Amdahl. La ley de Amdahl se refiere a la aceleracin de usar procesadores en paralelo en un problema a comparacin de usar solo un procesador serial. Para entender lo que es aceleracin, primero veremos que es velocidad: o La velocidad de un programa es el tiempo que le toma para ser ejecutado. Esto puede ser medido en cualquier incremento de tiempo. o La aceleracin (speedup) se define como el tiempo que le toma a un programa ejecutarse en serial (con un procesador) dividido por el tiempo que le toma ejecutarse en paralelo (con varios procesadores).
Arquitecturas Paralelas. Criterios Antiguos: Arquitecturas paralelas ligadas a modelos de programacin. Arquitecturas divergentes, sin ningn patrn de crecimiento.
Arquitecturas Paralelas. Criterios actuales: Extensin de la arquitectura de comunicaciones y cooperacin. o ANTES: Conjunto de instrucciones. o AHORA: Comunicaciones. computadores para soportar
Hay que definir: o Abstracciones, fronteras, primitivas (interfaces) o Estructuras que implementan los interfaces (hw o sw) Compiladores, libreras y OS son cuestiones importantes en nuestros das.
Arquitecturas Paralelas. Recordemos que: Un computador paralelo es un conjunto de elementos de proceso que se comunican y cooperan para resolver rpidamente grandes problemas. Podemos decir que la Arquitectura Paralela es: Arquitectura convencional + Arquitectura de comunicacin
Arquitecturas Paralelas. Arquitectura de comunicacin User/System Interface + Implementacin o User/System Interface: Primitivas de comunicacin a nivel de usuario y a nivel de sistema. o Implementacin: Estructuras que implementan las primitivas: hardware o OS Capacidades de optimizacin. Integracin en los nodos de proceso. Estructura de la red. Objetivos: o Rendimiento o Amplia aplicacin o Fcil programacin o Ampliable o Bajo coste
Arquitecturas Paralelas. Modelos de programacin Especifica las comunicaciones y la sincronizacin. Ejemplos: o Multiprogramacin: no hay comunicacin o sincronismo. Paralelismo a nivel de programa. o Memoria compartida: como un tabln de anuncios. o Paso de mensajes: como cartas o llamadas telefnicas, punto a punto. o Paralelismo de datos: varios agentes actan sobre datos individuales y luego intercambian informacin de control antes de seguir el proceso. El intercambio se implementa con memoria compartida o con paso de mensajes.
Arquitecturas Paralelas. Niveles de abstraccin en la comunicacin.
Arquitecturas Paralelas. Evolucin de los modelos arquitectnicos: Modelo de programacin, comunicacin y organizacin de la mquina componen la arquitectura. Espacio de memoria compartida. Paso de mensajes. Paralelismo de datos. Otras: Flujo de datos. Arrays sistlicos.
Memoria Compartida. Cualquier procesador puede referenciar directamente cualquier posicin de memoria. o La comunicacin se realiza implcitamente por medio de cargas y almacenamientos. Ventajas: o Localizacin transparente. o Programacin similar a tiempo compartido en uniprocesadores. Excepto que los procesos se ejecutan en diferentes procesadores. Buen rendimiento en distribucin de carga.
Memoria Compartida. Proporcionado en un amplio rango de plataformas. o Histricamente sus precursores datan de los aos 60. o Desde 2 procesadores a cientos de procesadores. Conocidas como mquinas de memoria compartida. o Ambigedad: la memoria puede estar distribuida procesadores.
en
los
Proceso: espacio de direcciones virtuales ms una o varias hebras. Parte de las direcciones son compartidas por varios procesos. o Las escrituras en posiciones compartidas son visibles a las otras hebras (tambin en otros procesos). o Es la extensin natural del modelo uniprocesador: memoria convencional; operaciones atmicas especiales para la sincronizacin. o El sistema operativo usa la memoria compartida para coordinar procesos.
Memoria Compartida.
Memoria Compartida. Hardware de Comunicacin:
Aumento independiente de capacidades de memoria, de I/O o de proceso aadiendo mdulos, controladores o procesadores.
Memoria Compartida. Estrategia de comunicaciones en Mainframe: Red de barras cruzadas. Inicialmente limitado por el coste de los procesadores. Despus, por el coste de la red. El ancho de banda crece con p. Alto coste de ampliacin; uso de redes multietapa.
Memoria Compartida. Estrategia de comunicaciones en Minicomputer: Casi todos los sistemas con microprocesadores usan bus. Muy usados para computacin paralela. Llamados SMP, symmetric multiprocessor. El bus puede ser un cuello de botella. Problema de la coherencia en cache. Bajo coste de ampliacin.
Memoria Compartida.
Ejemplo: Intel Pentium Pro Quad. Coherencia y multiproceso integrados en el modulo procesador.
Memoria Compartida.
Ejemplo: SUN Enterprise. 16 tarjetas de cualquier tipo: procesadores + memoria, o I/O. El acceso a la memoria es por bus, simtrico.
Memoria Compartida. Otras opciones en comunicacin:
Problemas de interconexin: coste (barras cruzadas) o ancho de banda (bus). Dance-hall: ampliable a menor coste que en barras cruzadas. o Latencia en acceso a memoria uniforme, pero alta. NUMA (non-uniform memory access): o Construccin de un simple espacio de memoria con latencias diferentes. COMA: (Cache Only Memory Architecture)
o Arquitectura de memoria a base de caches compartidas.
Memoria Compartida.
Ejemplo: Cray T3E Ampliable a 1024 procesadores, enlaces de 480MB/s. El controlador de memoria genera las peticiones para posiciones no locales. No tiene mecanismo de hardware para coherencia (SGI Origin y otros s lo proporcionan).
Paso de Mensajes. Construidos por medio de computadores completos, incluyendo I/O. o Comunicacin por medio de operaciones explcitas de I/O. Modelo de programacin: acceso directo slo a direcciones privadas (memoria local), comunicacin por medio de mensajes (send/receive) Diagrama de bloques similar al NUMA o Pero las comunicaciones se integran a nivel de I/O. o Como redes de workstations (clusters), pero mayor integracin. o Ms fciles de construir y ampliar que los sistemas NUMA. Modelo de programacin menos integrado en el hardware. o Libreras o intervencin del sistema operativo.
Paso de Mensajes. send especifica el buffer a transmitir y el proceso receptor. recv especifica el proceso emisor y el buffer de almacenamiento. Son copias memoria-memoria, pero se necesitan los nombres de procesos. Opcionalmente se puede incluir el destino en el envo y unas reglas de identificacin en el destino. En la forma simple, el emparejamiento se consigue por medio de la sincronizacin de sucesos send/recv. o Existen mltiples variantes de sincronizacin. Grandes sobrecargas: copia, manejo de buffer, proteccin.
Paso de Mensajes.
Paso de Mensajes. Evolucin en las mquinas de Paso de Mensajes Primeras mquinas: FIFO en cada enlace. o Modelo de programacin muy prximo al hw; operaciones simples de sincronizacin. o Reemplazado por DMA, permitiendo operaciones no bloqueantes. Buffer de almacenamiento en destino hasta recv. Disminucin de la influencia de la topologa (enrutado por hw). o Store&forward routing: importa la topologa. o Introduccin de redes multietapa. o Mayor coste: comunicacin nodo red. o Simplificacin de la programacin
Paso de Mensajes.
Ejemplo: IBM SP-2. Realizado a base de estaciones RS6000.
Paso de Mensajes.
Ejemplo Intel Paragon.
La convergencia de las arquitecturas. La evolucin y el papel del software ha difuminado las fronteras entre memoria compartida y paso de mensajes. o send/recv soporta memoria compartida va buffers. o Se puede construir un espacio global de direcciones en Paso de Mensajes. Tambin converge la organizacin del hardware. o Mayor integracin para Paso de Mensajes (menor latencia, mayor ancho de banda) o A bajo nivel, algunos sistemas de memoria compartida implementan paso de mensajes en hardware. Distintos modelos de programacin, pero tambin en convergencia.
Interconexin de sistemas paralelos La misin de la red en una arquitectura paralela es transferir informacin desde cualquier fuente a cualquier destino minimizando la latencia y con coste proporcionado. La red se compone de: o nodos; o conmutadores; o enlaces. La red se caracteriza por su: o topologa: estructura de la interconexin fsica; o enrutado: que determina las rutas que los mensajes pueden o deben seguir en el grafo de la red; o estrategia de conmutacin: de circuitos o de paquetes; o control de flujo: mecanismos de organizacin del trfico.
Interconexin de sistemas paralelos Clasificacin de las redes por su topologa. Estticas: o conexiones directas estticas punto a punto entre los nodos; o fuerte acoplamiento interfaz de red-nodo; o los vrtices del grafo de la red son nodos o conmutadores; Se clasifican a su vez: simtricas: anillo, hipercubo, toro; no simtricas: bus, rbol, malla.
Interconexin de sistemas paralelos Clasificacin de las redes por su topologa. Dinmicas: o los conmutadores pueden variar dinmicamente los nodos que interconectan. Se clasifican a su vez: monoetapa; multietapa: bloqueante (lnea base, mariposa, baraje); reconfigurable (Bene); no bloqueante (Clos).
Interconexin de sistemas paralelos Parmetros caractersticos de una red: Tamao de la red: nmero de nodos que la componen. Grado de un nodo: nmero de enlaces que inciden en el nodo. Dimetro de la red: es el camino mnimo ms largo que se puede encontrar entre dos nodos cualesquiera de la red. Simetra: una red es simtrica si todos los nodos son indistinguibles desde el punto de vista de la comunicacin.
Redes estticas
Redes estticas
Redes estticas Hipercubo 3D ciclo-conexo
Redes estticas Ejemplo de conexiones en un hipercubo 3

Conexin de nodos que se diferencian en el bit menos significativo
Conexin de nodos que se diferencian en el segundo bit
Conexin de nodos que se diferencian en el bit ms significativo
Redes dinamicas Redes dinmicas: son redes cuya configuracin puede modificarse. Hay dos tipos: o monoetapa. o multietapa. Las redes monoetapa realizan conexiones entre elementos de proceso en una sola etapa. o Puede que no sea posible llegar desde cualquier elemento a cualquier otro, por lo que puede ser necesario recircular la informacin (=>redes recirculantes) Las redes multietapa realizan conexiones entre los elementos de proceso en ms de una etapa.
Redes dinmicas Redes de interconexin monoetapa
Redes dinmicas Red de barras cruzadas: permite cualquier conexin.
Redes dinmicas Redes de interconexin (multietapa) Cajas de conmutacin
Las cuatro configuraciones posibles de una caja de conmutacin de 2 entradas.
Redes dinmicas bloqueantes Redes multietapa bloqueantes. o Se caracterizan porque no es posible establecer siempre una nueva conexin entre un par fuente/destino libres, debido a conflictos con las conexiones en curso. o Generalmente existe un nico camino posible entre cada par fuente/destino.
Redes dinmicas bloqueantes Red de lnea base:
Redes dinmicas bloqueantes Red mariposa:
Redes dinmicas bloqueantes Red baraje perfecto:
Redes dinmicas reconfigurables Redes multietapa reconfigurables. o Se caracterizan porque es posible establecer siempre una nueva conexin entre un par fuente/destino libres, aunque haya conexiones en curso, pero puede hacerse necesario un cambio en el camino usado por alguna(s) de ellas (reconfiguracin). o Interesante en procesadores matriciales, en donde se conoce simultneamente todas las peticiones de interconexin.
Redes dinmicas reconfigurables Red de Bene:
Redes dinmicas reconfigurables La red de Bene se puede construir recursivamente:
Redes dinmicas no bloqueantes Redes dinmicas no bloqueantes. o Se caracterizan porque es posible establecer siempre una nueva conexin entre un par fuente/destino libres sin restricciones. o Son anlogas a los conmutadores de barras cruzadas, pero pueden presentar mayor latencia, debido a las mltiples etapas.
Redes dinmicas no bloqueantes Red de Clos:
Coherencia en Memoria Cache Estructuras comunes de la jerarqua de memoria en multiprocesadores Memoria cache compartida. Memoria compartida mediante bus. Interconexin por medio de red (dance-hall) Memoria distribuida.
Coherencia en Memoria Cache Cache compartida Pequeo nmero de procesadores (28) Fue comn a mediados de los 80 para conectar un par de procesadores en placa. Posible estrategia en chip multiprocesadores.
Coherencia en Memoria Cache Comparticin por medio de bus. Ampliamente usada en multiprocesadores de pequea y mediana escala (20-30) Forma dominante en las mquinas paralelas actuales. Los microprocesadores modernos estn dotados para soportar protocolos de coherencia en esta configuracin.
Coherencia en Memoria Cache
Saln de baile Fcilmente escalable. Estructura simtrica UMA. Memoria demasiado lejana especialmente en grandes sistemas.
Coherencia en Memoria Cache Memoria distribuida Especialmente atractiva par multiprocesadores escalables. Estructura no simtrica NUMA. Accesos locales rpidos.
Arquitecturas paralelas de computadoras. Tipos de organizacin de procesadores: Existen 7 importantes mtodos de organizacin de procesadores para conectar procesadores en una computadora paralela. Redes de Rejilla (Mesh Networks). Redes de rbol binario (Binary Tree Networks). Redes de hiperrbol (Hypertree Networks). Redes de Pirmide (Pyramid Networks). Redes de mariposa (Butterfly Network). Redes de hipercubo (Hypercube Networks). Redes de ciclos de cubos conectados o cubo en ciclos (Cube-Connected Cycles Networks).
A la organizacin de procesadores tambin se le conoce como topologa, y de acuerdo a sta, existen criterios para evaluar cul modelo es ms conveniente que otro. Las aplicaciones tambin pueden definir si un modelo conviene o no. Criterios de evaluacin de modelos: Dimetro. Ancho de biseccin. Nmero de aristas por nodo. Mxima longitud de aristas. Grado de una arquitectura.
Mesh Networks. Nodes are arranged into a q-dimensional lattice. Communication is allowed only between neighboring nodes Two-dimensional meshes. o Mesh with no wrap-around connections. o Mesh with wrap-around connections between processors in same row or column. o Mesh with wrap-around connections between processors in adjacent rows or columns. o Wrap-around connections can connect processors in the same row or column adjacent rows or columns.
Evaluation of the Mesh:

Interior nodes communicate with 2q other processors The diameter of a q-dimensional mesh with kq nodes is q(k - 1) The bisection width of a q-dimensional mesh with kq nodes is kq-1. The maximum number of edges per node is 2q. The maximum edge length is a constant, independent of the number of nodes, for two- and threedimensional meshes. The two-dimensional mesh has been a popular topology for processor arrays. o Goodyear Aerospace's MPP o AMT DAP o MasPar's MP o The Intel Paragon XP/S multicomputer connects processors with a two dimensional mesh.
Binary Tree Networks 2k1 nodes are arranged into a complete binary tree of depth k1. Node has at most three links. Every interior node can communicate with its two children and every node other than the root can communicate with its parent. The binary tree has low diameter - 2(k - 1) poor bisection width of one it is impossible to arrange the nodes of a binary tree in threedimensional space such that as o the number of nodes increases, o the length of the longest edge is always less than a specified constant.
Hypertree Networks An approach to building a network with the low diameter of a binary tree with an improved bisection width. The easiest way to think of a hypertree network of degree k and depth d is to consider the network from two different angles. From the front a hypertree network of degree k and depth d looks like a complete k-ary tree of height d. From the side, the same hypertree network looks like an upside down binary tree of height d. Joining the front and side views yields the complete network.
Hypertree evaluation 4-ary hypertree with depth d has 4d leaves 2d (2d+1 - 1) nodes. diameter is 2d. bisection width is 2d+1. number of edges per node is never more than six. maximum edge length is an increasing function of the problem size.
Hypertree network of degree 4 and depth 2 (a) Front view (b) Side view (c) Complete network. Connection Machine CM-5 multicomputer is a 4-ary hypertree.
Pyramid Networks An attempt to combine advantages of mesh networks tree networks. A pyramid network of size k2 is a complete 4-ary rooted tree of height log2 k augmented with additional interprocessor links the processors in every tree level form a 2-D mesh network A pyramid of size k2 has at its base a 2-D mesh network containing k2 processors. The total number of processors in a pyramid of size k2 is (4/3)k2 - (1/3). The levels of the pyramid are numbered in ascending order. The base has level number 0, and the single processor at the apex of the pyramid has level number log2 k.
Pyramid network of size 16. Every interior processor is connected to nine other processors: one parent, four mesh neighbors, four children.
Pyramid evaluation The advantage of the pyramid over the 2-D mesh is The pyramid reduces the diameter of the network. When a message must travel from one side of the mesh to the other, fewer link traversals are required if the message travels up and down the tree rather than across the mesh. The diameter of a pyramid of size k2 is 2log k. The addition of tree links does not give a significantly higher bisection width than a 2-D mesh. The bisection width of a pyramid of size k2 is 2k. The maximum number of links per node is no greater than nine, regardless of the size of the network. Unlike a 2-D mesh, the length of the longest edge is an increasing function of the network size.
Butterfly Network Consists of (k+1)2k nodes divided into k+1 rows, or ranks, each containing n=2k nodes. The ranks are labeled 0 through k. The ranks 0 and k are sometimes combined, giving each node four connections to other nodes.
Node connection Let node(i, j) refer to the jth node on the ith rank, where 0 <= i <=k and 0 <= j < n. Node(i, j) on rank i > 0 is connected to two nodes on rank i-1, node(i-1, j) and node(i-1, m), m is the integer found by inverting the ith most significant bit in the binary representation of j. If node(i, j) is connected to node(i-1, m), then node(i, m) is connected to node (i-1, j).
Butterfly evaluation As the rank numbers decrease, the widths of the wings of the butterflies increase exponentially. The length of the longest network edge increases as the number of network nodes increases. The diameter of a butterfly network with (k + 1)2k nodes is 2k. The bisection width is 2k-1. A butterfly network serves to route data from non local memory to processors on the BBN TC2OOO multiprocessor.
Hypercube A cube-connected network, also called a binary n-cube network, is a butterfly with its columns collapsed into single nodes. Consists of 2k nodes forming a k-dimensional hypercube. The nodes are labeled 0, 1 2k-1; Two nodes are adjacent if their labels differ in exactly one bit position.
A four-dimensional hypercube.
Hypercube evaluation The diameter of a hypercube with 2k nodes is k The bisection width of that size network is 2k-1, The hypercube organization has low diameter High bisection width at the expense of the number of edges per node and the length of the longest edge. The number of edges per node is k-the logarithm of the number of nodes in the network. The length of the longest edge in a hypercube network increases as the number of nodes in the network increases.
Cube-Connected Cycles Networks The cube-connected cycles network is a k-dimensional hypercube 2k "vertices" are actually cycles of k nodes. For each dimension, every cycle has a node connected to a node in the neighboring cycle in that dimension.
24-node cube-connected cycles network.
Cycles hypercube evaluation Node(i, j) is connected to node(i, m) if and only if m is the result of inverting the ith most significant bit of the binary representation of j. Compared to the hypercube, the cube-connected cycles processor organization has the advantage, the number of edges per node is three - a constant independent of network size.
Disadvantage the diameter is twice that of a hypercube the bisection width is lower. Given a cube-connected cycles network of size k2k, o its diameter is 2k its bisection width is 2k-1
CHARACTERISTICS OF VARIOUS PROCESSOR ORGANIZATIONS
Flujo de datos y paralelismo implcito Von Neumann vs. Paralelo

Parallel random access machine (PRAM) Provides a mental break from the Von Neumann model and sequential algorithms. PRAM (pronounced "pea ram") model of parallel computation. Allows parallel-algorithm designers to treat processing power as an unlimited resource Unrealistically simple; Ignores the complexity of interprocessor communication. The designer of PRAM algorithms can focus on the parallelism inherent in a particular computation. Cost-optimal PRAM solutions exist, meaning that the total number of operations performed by the PRAM algorithm is of the same complexity class as an optimal sequential algorithm. Cost-optimal PRAM algorithms can serve as a foundation for efficient algorithms on real parallel computers.
Modelos computacionales RAM y PRAM RAM, model of Serial Computation The random access machine (RAM) is a model of a one-address computer. Consists of: memory read-only input tape write-only output tape program
RAM program
The program not stored in memory can not be modified. The input tape contains a sequence of integers. Every time an input value is read, the input head advances one square. The output head advances after every write. Memory consists of an unbounded sequence of registers r0, r1, r2, . Each register can hold a single integer. Register r0 is the accumulator, where computations are performed. The exact instructions are not important, as long as they resemble the instructions found on an actual computer. load, store, read, write, add, subtract, multiply, divide, test, jump, and halt.
RAM Time Complexity The worst-case time complexity of a RAM program is the function f(n) maximum time taken by the program to execute over all inputs of size n. The expected time complexity of a RAM program is the average, over all inputs of size n, of the execution times. Analogous definitions hold for worst-case space complexity expected space complexity. There are two ways of measuring time and space on the RAM model. uniform cost criterion logarithmic cost criterion.
Cost criterion The uniform cost criterion says each RAM instruction requires one time unit to execute every register requires one unit of space. The logarithmic cost criterion takes into account that an actual word of memory has a limited storage capacity. The uniform cost criterion is appropriate if the values manipulated by the program always fit into one computer word.
The PRAM model of parallel computation A PRAM consists of a control unit global memory an unbounded set of processors each with its own private memory Active processors execute identical instructions Every processor has a unique index The value of a processor's index can be used to enable or disable the processor Influence which memory location it accesses
PRAM computation A PRAM computation begins with input stored in global memory single active processing element. During each step, an active enabled processor read a value from a single private or global memory location perform a single RAM operation write into one local or global memory location. may activate another processor. The processors are synchronized. All active, enabled processors must execute the same instruction, on different memory locations. The computation terminates when the last processor halts.
Modelos de PRAM differ in how they handle read or write conflicts; when two or more processors attempt to read from, or write to, the same global memory location. 1 EREW (Exclusive Read Exclusive Write): Read or write conflicts are not allowed. 2 CREW (Concurrent Read Exclusive Write): Concurrent reading allowed; i.e., multiple processors may read from the same global memory location during the same instruction step. Write conflicts are not allowed. (This is the default PRAM model.) 3 CRCW (Concurrent Read Concurrent Write): Concurrent reading and con-current writing allowed. A variety of CRCW models exist with different policies for handling concurrent writes to the same global address.
Tipos de PRAM CRCW Three different models: Common Arbitrary Priority Various CRCW PRAM 1. Common. All processors concurrently writing into the same global address must be writing the same value. 2. Arbitrary. If multiple processors concurrently write to the same global address, one of the competing processors is arbitrarily chosen as the "winner," and its value is written into the register. 3. Priority. If multiple processors concurrently write to the same global address, the processor with the lowest index succeeds in writing its value into the memory location.
Strengths of PRAM models EREW PRAM model is the weakest. Clearly a CREW PRAM can execute any EREW PRAM algorithm in the same amount of time; the concurrent read facility is simply not used. CRCW PRAM can execute any CREW PRAM algorithm in the same amount of time. PRIORITY PRAM model is the strongest. Any algorithm designed for the COMMON PRAM model will execute with the same complexity on the ARBITRARY PRAM and PRIORITY PRAM models.
Strengths of PRAM models if all processors writing to the same location write the same value, choosing an arbitrary processor would cause the same result. if an algorithm executes correctly when an arbitrary processor is chosen as the "winner," the processor with the lowest index is as reasonable an alternative as any other. Any algorithm designed for the ARBITRARY PRAM model will execute with the same time complexity on the PRIORITY PRAM model. Because the PRIORITY PRAM model is stronger than the EREW PRAM model, an algorithm to solve a problem on the EREW PRAM can have higher time complexity tan an algorithm solving the same problem on the PRIORITY PRAM model.
Increase in parallel time complexity The increase in parallel time complexity can occur when moving from the PRIORITY PRAM model to the EREW PRAM model. Lemma. A p-processor EREW PRAM can sort a p-element array stored in global memory in (log p) time. Theorem. A p-processor PRIORITY PRAM can be simulated by a p-processor EREW PRAM with the time complexity increased by a factor of (log p).
Simulation PRIORITY PRAM by EREW PRAM Assume the PRIORITY PRAM algorithm uses processors P1, P2 Pp global memory locations M1, M2 Mm. The EREW PRAM uses auxiliary global memory locations T1, T2 Tp and S1, S2 Sp to simulate each read or write step of the PRIORITY PRAM. When processor Pi in the PRIORITY PRAM algorithm accesses memory location Mj, processor Pi in the EREW PRAM algorithm writes the ordered pair (j,i) in memory location Ti.
Then the EREW PRAM sorts the elements of T. This step takes time (log p) (Lemma). By reading adjacent entries in the sorted array, the highest priority processor accessing any particular location can be found in constant time. Processor P1 reads memory location T1, retrieves the ordered pair (i1,j1), and writes a 1 into global memory location Sj1 The remaining processors Pk, where 2<=k<=p, first read memory location Tk then read memory location Tk-1. If ik ik-1, then processor Pk writes a 1 into Sjk. Otherwise, processor Pk writes a 0 into Sjk At this point the elements of S with value 1 correspond to the highest priority processors accessing each memory location
A concurrent write operation A concurrent write operation, which takes constant time on a p-processor PRIORITY PRAM, can be simulated in (log p) time on a p-processor EREW PRAM. (a) Concurrent write on the PRIORITY PRAM model. Processors P1, P2, and P4 attempt to write values to memory location M3. Processor P1 wins. Processors P3 and P5 attempt to write values to memory location M7. Processor P3 wins.
Concurrent write on the EREW PRAM model Each processor writes an (address, processor number) pair to a unique element of T. The processors sort T in time (log p). In constant time processors can set to 1 those elements of S corresponding to the winning processors. Winning processors write their values. For a write instruction, the highest priority processor accessing each memory location writes its value. For a read instruction, the highest priority processor accessing each memory location reads that location's value, then duplicates the value in (log p) time so there is a copy in a unique memory location for every processor to access the value.
ALGORITMOS PRAM If a PRAM algorithm has lower time complexity than an optimal RAM algorithm, it is because parallelism has been used. PRAM algorithms begin with only a single processor active have two phases. o In the first phase a sufficient number of processors are activated, o these activated processors perform the computation in parallel. Given a single active processor, it is easy to see that logP activation steps are both necessary and sufficient for p processors to become active. Processor activation Exactly logP processor activation steps are necessary and sufficient to change from 1 active processor to p active processors.
Modelo del rbol Binario The binary tree is one of the most important paradigms of parallel computing In some algorithms data flows top-down from the root of the tree to the leaves. Broadcast and divide-and-conquer algorithms both fit this model. In broadcast algorithms the root sends the same data to every leaf. In divide-and-conquer algorithms the tree represents the recursive subdivision of problems into subproblems. In other algorithms data flows bottom-up from the leaves of the tree to the root. These are called fan-in or reduction operations.
Reduccin Paralela Given a set of n values a1, a2, , an an associative binary operator , reduction is the process of computing a1a2an Parallel summation is an example of a reduction operation.
We represent each tree node with an element in an array. The mapping from the tree to the array is straightforward.
Sum of n values
Complexity: The spawn routine requires log n/2 doubling steps. The sequential for loop executes log n times each iteration has constant time complexity. Hence the overall time complexity of the algorithm is (log p), given n/2 processors.
Organizacin de procesadores Organizacin de procesadores representados por grafos A processor organization can be represented by a graph nodes (vertices) represent processors edges represent communication paths between pairs of processors Processor organizations are evaluated according to criteria that help us understand their effectiveness in implementing efficient parallel algorithms on real hardware. These criteria are: Diameter. Bisection width. Number of edges per node Maximum edge length
Dimetro The diameter of a network is the largest distance between two nodes. Low diameter is better, because the diameter puts a lower bound on the complexity of parallel algorithms requiring communication between arbitrary pairs of nodes.
Ancho de biseccin de la red The bisection width of a network is the minimum number of edges that must be removed in order to divide the network into two halves (within one). High bisection width is better, because in algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of the parallel algorithm.
Nmero de aristas por nodo It is best if the number of edges per node is a constant independent of the network size The processor organization scales more easily to systems with large numbers of nodes. Maximum edge length For scalability reasons it is best if the nodes and edges of the network can be laid out in three-dimensional space so that the maximum edge length is a constant independent of the network size.
Introduccin a la complejidad de los algoritmos paralelos Tiempos de corrimiento Tipos de anlisis para algoritmos 1. Peor caso (usualmente) T(n) = tiempo mximo necesario para un problema de tamao n 2. Caso medio o caso promedio T(n) = tiempo esperado para un problema de tamao n Se requiere establecer una distribucin estadstica 3. Mejor caso T(n) = tiempo mnimo para un problema de tamao n. Esto puede ser engaoso porque no siempre ocurre.
La complejidad del peor caso de un programa RAM es la funcin f(n), el tiempo mximo que le toma al programa ejecutar todas las entradas de tamao n. La complejidad del tiempo esperado de un programa RAM es el tiempo promedio que le toma ejecutar las entradas de tamao n. Existen dos maneras de medir el tiempo en el modelo RAM: Criterio de costo uniforme: cada instruccin RAM requiere de una unidad de tiempo para ejecutar y cada registro requiere una unidad de espacio en memoria. Es apropiado si los valores manipulados por el programa caben dentro de una palabra de cmputo. Criterio de costo logartmico: se toma en cuenta que una palabra actual en memoria tiene una capacidad limitada en memoria.
La gran O Usualmente se utiliza la notacin O (g(x)) para referirse a las funciones acotadas superiormente por la funcin g(x). La cota ajustada asinttica (notacin ) tiene relacin con las cotas superior e inferior asintticas (notacin ): Una cota superior asinttica es una funcin que sirve de cota superior de otra funcin cuando el argumento tiende a infinito.
Ordenes usuales para funciones: Los rdenes ms utilizados en anlisis de algoritmos, en orden creciente, son los siguientes (donde c representa una constante):
Dependiendo del orden de la funcin, se dice que es eficiente o no. Para un problema de tamao n, el orden de la funcin que lo resuelve puede ser desde ptimo (mejor caso) o eficiente hasta ineficiente (peor caso).
DISEO
DE
PROGRAMAS
PARALELOS Sistemas de memoria nica Single Memory Systems The CHALLENGE/Onyx uses a high speed system bus to connect all components of the system. Memory has features: There is a single address map; that is, the same word of memory has the same address in every CPU. There is no time penalty for communication between processes because every memory word is accessible in the same amount of time from any CPU. All peripherals are equally accessible from any process. Processes running in different CPUs can share memory and can update the identical memory locations concurrently.
Processes running in different CPUs can share memory and can update the identical memory locations concurrently. Processes map a single segment of memory into the virtual address spaces of two or more concurrent processes. Two processes can transfer data at memory speeds, one putting the data into a mapped segment and the other process taking the data out. They can coordinate their access to the data using semaphores located in the shared segment.
SISTEMAS DE MEMORIA MLTIPLE Multiple Memory Systems There is not a single address map. A word of memory in one node cannot be addressed at all from another node. There is a time penalty for some interprocess communication. Peripherals are accessible only in the node to which they are physically attached. The message-passing interface (MPI) is designed specifically for an application that executes concurrently in multiple nodes.
Modelos de ejecucin paralela Two features of the models for parallel programming:
Granularity
The relative size of the unit of computation that executes in parallel: a single statement, a function, or an entire process. The basic mechanism by which the independent, concurrent units of the program exchange data and synchronize their activity.
Communication channel
Paralelismo a nivel procesos A UNIX process consists of an address space, a large set of process state values, one thread of execution. Interprocess communication (IPC) mechanisms can be used o to exchange data o to coordinate the activities of multiple, asynchronous processes. In traditional UNIX practice, one process creates another with the system call fork(), which makes a duplicate of the calling process, after which the two copies execute in parallel. Typically the new process immediately uses the exec() function to load a new program.
Lightweight process It shares some of its process state values with its parent process. It does not have its own address space. It continues to execute in the address space of the original process. A lightweight process differs from a thread in two significant ways: It has a full set of UNIX state values. Some of these, for example the table of open file descriptors, can be shared with the parent process, but in general a lightweight process carries most of the state information of a process. Dispatch of lightweight processes is done in the kernel, and has the same overhead as dispatching any process. The library support for statement-level parallelism is based on the use of lightweight processes.
Process Creation The process that creates another is called the parent process. The processes it creates are child processes. The parent and its children together are a share group. The fork () function is the traditional UNIX way of creating a process. The new process is a duplicate of the parent process, running in a duplicate of the parent's address space. Both execute the identical program text. A parent process should not terminate while its child processes continue to run.
Process Management When the parent process has nothing to do after starting the child processes, it can loop on wait() until wait() reports no more children exist; then exit. Sometimes it is necessary to handle child termination, and the parent cannot suspend. In this case the parent can treat the termination of a child process as anasynchronous event.
Parallelism in Real-Time Applications In real-time programs such as aircraft or vehicle simulators, separate processes are used to divide the work of the simulation and distribute it onto multiple CPUs. In these applications, IRIX facilities (REACT Real-Time Programmer's Guide ) can be used to: reserve one or more CPUs of a multiprocessor for exclusive use by the application isolate the reserved CPUs from all interrupts assign specific processes to execute on specific, reserved CPUs
The Frame Scheduler seizes one or more CPUs of a multiprocessor, isolates them, executes a specified set of processes on each CPU in strict rotation. The Frame Scheduler has much lower overhead than the normal IRIX scheduler, features designed for real-time work, including o detection of overrun (when a scheduled process does not complete its work in the necessary time) o underrun (when a scheduled process fails to execute in its turn).
Paralelismo a nivel hilos A thread is an independent execution state within the context of a larger program; that is, a set of machine registers, a call stack, the ability to execute code. A program can create many threads to execute in the same address space. There are two main reasons of using threads: portability performance.
There are three key differences between a thread and a process: A UNIX process has its own set of UNIX state information, for example, its own effective user ID and set of open file descriptors. Threads exist within a process and do not have distinct copies of these state values. Threads share the single state belonging to their process. Each UNIX process has a unique address space that are accessible only to that process. Threads within a process share the single address space belonging to their process. Processes are scheduled by the kernel. Threads are scheduled by code that operates in the user address space, without kernel assistance. Thread scheduling can be faster than process scheduling.
Threads: takes relatively little time to create or destroy, as compared to creating a lightweight process. shares all resources and attributes of a single process (except for the signal mask). If you want each executing entity to have its own set of file descriptors to make sure that one entity cannot modify data shared with another entity You must use lightweight processes or normal processes. Threads cannot use these IPC mechanisms.
Threads can coordinate using mechanisms: Unnamed semaphores for general coordination and resource management. Message queues. Mutex objects, which allow threads to gain exclusive use of a shared variable. Condition variables, which allow a thread to wait when a controlling predicate is false. Semaphores, locks, and barriers to coordinate between multiple threads within a single program.
Mutexes: A mutex is a software object that stands for the right to modify some shared variable, the right to execute a critical section of code. A mutex can be owned by only one thread at a time; other threads trying to acquire it wait. When a thread wants to modify a variable that it shares with other threads, or execute a critical section, the thread claims the associated mutex. This can cause the thread to wait until it can acquire the mutex. When the thread has finished using the shared variable or critical code, it releases the mutex. If two or more threads claim the mutex at once, one acquires the mutex and continues, the others are blocked until the mutex is released. A mutex has attributes that control its behavior.
Condition Variables A condition variable provides a way in which a thread can temporarily give up ownership of a mutex, wait for a condition to be true, then reclaim ownership of the mutex, All in a single operation. Preparing Condition Variables Condition variables are supplied with a mechanism of attribute objects static and dynamic initializers. A condition variable must be initialized before use.
Using Condition Variables A condition variable is a software object that represents a test of a Boolean condition. Typically the condition changes because of a software event such as "other thread has supplied needed data." A thread that wants to wait for that event claims the condition variable, which causes it to wait. The thread that recognizes the event signals the condition variable, releasing one or all threads that are waiting for the event.
A thread holds a mutex that represents a shared resource. While holding the mutex, the thread finds that the shared resource is not complete or not ready. The thread needs to do three things Give up the mutex so that some other thread can renew the shared resource. Wait for the event that "resource is now ready for use." Re-acquire the mutex for the shared resource. These three actions are combined into one using a condition variable. When the event is signalled (or the time limit expires), the mutex is reacquired.
Paralelismo a nivel declaraciones Statement-Level Parallelism. The finest level of granularity. Statement-level parallel support is based on using common variables in memory, and so it can be used only within the bounds of a single-memory system. The method of creating an optimized parallel program is as follows: Write a complete application that runs on a single processor. Completely debug and verify the correctness of the program in serial execution. Apply the source analyzer. Add assertions to the source program. These are not explicit commands to parallelize, but high-level statements that describe the program's use of data. Run the program on a single-memory multiprocessor.
Modelos de computacin distribuida

Modelo MPI (Message-Passing-Interface) MPI is a standard programming interface for the construction of a portable, parallel application in Fortran 77 or in C, especially when the application can be decomposed into a fixed number of processes operating in a fixed topology (for example, a pipeline, grid, or tree). Modelo PVM (Parallel Virtual Machine) PVM is an integrated set of software tools and libraries that emulates a general-purpose, flexible, heterogeneous, concurrent computing framework on interconnected computers of varied architecture. Using PVM, you can create a parallel application that executes as a set of concurrent processes on a set of computers that can include uniprocessors, multiprocessors, and nodes of Array systems.
Each is a formal, abstract model for distributing a computation across the nodes of a multiple-memory system, without having to reflect the system configuration in the source code. Processes and threads allow you to execute in parallel within a single system memory. When the system memory is distributed among multiple independent machines, your program must be built around a message-passing model. In a message-passing model, your application consists of: multiple, independent processes, each with its own address space, running in possibly many different computers. Each process shares data and coordinates with the others by passing messages. IRIX supports two libraries: Message-Passing Interface (MPI) Portable Virtual Machine (PVM).
Choosing Between MPI and PVM MPI interface is a primary and preferred model for distributed applications. In many ways, MPI and PVM are similar: Each is designed, specified, and implemented by third parties that have no direct interest in selling hardware. Support for each is available over the Internet at low or no cost. Each defines portable, high-level functions that are used by a group of processes to make contact and exchange data without having to be aware of the communication medium. Each supports C and Fortran 77. Each provides for automatic conversion between different representations of the same kind of data so that processes can be distributed over a heterogeneous computer network.
MPI The primary reason MPI is preferred is performance. The design of MPI is such that a highly optimized implementation could be created for the homogenous environment. MPI applications take advantage to exchange data with small latencies and high data rates. Another difference between MPI and PVM is in the support for the "topology" (the interconnect pattern: grid, torus, or tree) of the communicating processes. In MPI, the group size and topology are fixed when the group is created. This permits low-overhead group operations. In PVM, group composition is dynamic and causes more overhead in common grouprelated operations.
Converting a PVM program into an MPI program A large extent the library calls of MPI and PVM provide similar functionality Some PVM calls do not have a counterpart in MPI, and vice versa. The semantics of some of the equivalent calls are inherently different for the two libraries. The process of converting a can be complicated, depending on the particular PVM calls and how they are used. PVM includes a console, which is useful for monitoring and controlling the states of the machines in the virtual machine and the state of execution of a PVM job. The MPI standard does not provide mechanisms for specifying the initial allocation of processes to an MPI computation and their binding to physical processors.
The differences between PVM and MPI The chief differences between the current versions of PVM and MPI libraries are as follows: PVM supports dynamic creating of tasks, whereas MPI does not. PVM supports dynamic process groups; that is, groups whose membership can change dynamically at any time during a computation. MPI does not support dynamic process groups. The chief difference between PVM groups and MPI communicators is that any PVM task can join/leave a group independently, whereas in MPI all communicator operations are collective. A PVM task can add or delete a host from the virtual machine, thereby dynamically changing the number of machines a program runs on. This is not available in MPI.
The differences between PVM and MPI (2) PVM provides two methods of signaling other PVM tasks: o sending a UNIX signal to another task, o notifying a task about an event by sending it a message with a userspecified tag that the application can check. o These functions are not available in MPI. A task can leave from a PVM session as many times as it wants, whereas an MPI task must initialize/finalize exactly once. A PVM task can be registered by another task as responsible for adding new PVM hosts, or as a PVM resource manager, or as responsible for starting new PVM tasks. These features are not available in MPI. A PVM task can multicast data to a set of tasks. As opposed to a broadcast, this multicast does not require the participating tasks to be members of a group. MPI does not have a routine to do multicasts.
On the other hand, MPI provides several features that are not available in PVM, including a variety of communication modes, communicators, derived data types, additional group management facilities, virtual process topologies, larger set of collective communication calls.
Programming models supported by PVM Pure SPMD Program In the SPMD program model, n instances of the same program are started as the n tasks of the parallel job, using the spawn command (or by hand at each of the n hosts simultaneously). No tasks are dynamically spawned in the tasks. This scenario is essentially the same as the current MPI one where no tasks are dynamically spawned. General SPMD Model In this model, n instances of the same program are executed as n tasks of the parallel job. One or more tasks are started at the beginning, and these dynamically spawn the remaining tasks in turn.
MPMD Model In an MPMD programming model, one or more distinct tasks (having different executables) are started by hand, and these tasks dynamically spawn other (possibly distinct) tasks.
Control del paralelismo Segmentos La memoria se manipula a travs de sus localidades de memoria, pero su direccionamiento puede ser: lineal (directo), segmentacin (datos, cdigo, pila y extras), descriptores globales y locales (indican localidades compartidas por diferentes segmentos). paginacin (lineal + desplazamiento+tabla pginas + directorio de tablas) combinacin de segmentacin y paginacin (memoria virtual y niveles de proteccin: Kernel-0, Servicios del sistema-1, Extensiones Sist. Op. -2, Aplicaciones-3.) Kernel es el que tiene el mximo privilegio y la mayor proteccin. Excepciones (instrucciones) Interrupciones (dispositivos).
Procesos MPI es el mejor ejemplo del control de paralelismo por medio de procesos. Semforos Objetos de software usados para coordinar el acceso a recursos contados. Lectores y escritores Son objetos de software que marcan e interpretan una seal en cada proceso. Secciones crticas Cuellos de botella. Sincronizacin Una manera conveniente de sincronizar procesos paralelos en sistemas de multiprocesadores, son las barreras.
Comunicacin interprocesos "Tipos de Comunicacin interprocesos Compartir memoria entre procesos (memoria compartida) Exclusin Mutua, incluye semforos (semaphores), candados (locks), y similares Sealizacin de eventos Colas de mensajes (Message Queues), describe dos variedades de cola de mensajes Bloqueo de archivo y grabado (File and Record Locking) Una comunicacin Interproceso (IPC) es Cualquier coordinacin de las acciones de mltiples procesos Enviar datos de un procesador a otro No se deben mezclar las implementaciones de un mecanismo dado en un programa sencillo, pueden dar resultados impredecibles.
TENDENCIAS DEL CMPUTO PARALELO PROYECTO GRID Is a single destination site for large-scale, non-profit research projects of global significance. With the participation of over 3 million devices worldwide, grid.org projects like Cancer Research, Anthrax Research, Smallpox Research and the new Human Proteome Folding Project (running in conjunction with IBM's new World Community Grid) have achieved record levels of processing speed and success. Grid.org projects are powered by United Devices' Grid MP technology, the leading solution for commercial enterprise grid deployments.
The basics Grid computing is a form of distributed computing that involves coordinating and sharing computing, application, data, storage, or network resources across dynamic and geographically dispersed organizations. Grid technologies promise to change the way organizations tackle complex computational problems. However, the vision of large scale resource sharing is not yet a reality in many areas Grid computing is an evolving area of computing, where standards and technology are still being developed to enable this new paradigm.
Why is it important? Time and Money. Organizations that depend on access to computational power to advance their business objectives often sacrifice or scale back new projects, design ideas, or innovations due to sheer lack of computational bandwidth. Project demands simply outstrip computational power, even if an organization has significant investments in dedicated computing resources. Even given the potential financial rewards from additional computational access, many enterprises struggle to balance the need for additional computing resources with the need to control costs. Upgrading and purchasing new hardware is a costly proposition, and with the rate of technology obsolescence, it is eventually a losing one. By better utilizing and distributing existing compute resources, Grid computing will help alleviate this problem. Delivering grid benefits today
Many companies want to take advantage of the cost and efficiency benefits that come from a grid infrastructure today, without being locked in to a system that will not grow with their needs. To provide customers the solution they need, United Devices tackled the complex security, scalability and unobtrusiveness issues required for a superior enterprise grid, while building towards the open standards of the GGF. By embracing these standards, United Devices lets our customers move toward compatibility with future grid technologies and adopt upcoming technologies as they are developed while delivering the promises and benefits of the grid today.
The Grid MP platform by United Devices works by amalgamating the underutilized IT resources on a corporate network into a powerful enterprise grid that can be shared by groups across the organization even geographically disparate groups. The most common corporate technology asset, desktop PCs, are also the most underutilized, often only using 10% of their total compute power even when actively engaged in their primary business functions. By harnessing these plentiful underused computing assets and leveraging them for revenue-driving projects, the Grid MP platform provides immediate value for companies who want to move forward with their grid strategies without limiting any future grid developments.
The benefits of building an enterprise grid with Grid MP platform include: Lower Computing Costs On a price-to-performance basis, the Grid MP platform gets more work done with less administration and budget than dedicated hardware solutions. Depending on the size of your network, the price-forperformance ratio for computing power can literally improve by an order of magnitude.
Faster Project Results
The extra power generated by the Grid MP platform can directly impact an organization's ability to win in the marketplace by shortening product development cycles and accelerating research and development processes. Better Product Results Increased, affordable computing power means not having to ignore promising avenues or solutions because of a limited budget or schedule. The power created by the Grid MP platform can help to ensure a higher quality product by allowing higher-resolution testing and results, and can permit an organization to test more extensively prior to product release.
Linux and Beowulf Linux is an open-source, freely-available Unix kernel. It includes free compilers, debuggers, editors, text processors, WWW servers, mail servers, SQL servers and a whole range of useful tools
Beowulf is a project that produced parallel Linux clusters with off-the-shelf hardware and freely available software o Uses inexpensive Intel xx86 based boxes (PCs) o One or more networking methods (10/100 Ethernet, Myrinet, ATM, etc) o A fast network connectivity (hub, switch, gigabit switch or other) o Linux operating system (freely available Unix clone, includes full source code) o cc, f77, vi, emacs, perl, python and other free compilers and tools o Free PVM or MPI implementations (MPICH, LAM) , etc o Commercial HPF implementations are available PC-cluster style supercomputer, but much cheaper
Herramientas para Programacin Paralela Introduccin a los clusters Linux La compaa de efectos especiales Weta Digital utiliz un cluster de 200 servidores Linux para realizar animaciones computarizadas en la pelcula El Seor de los Anillos. Weta Digital colabor con otros proyectos como Shrek y Titanic. La ventaja de usar Linux es el bajo costo y el cdigo abierto (open source) que permite adaptar el sistema a las necesidades tcnicas. Un cluster es el conjunto de procesadores fsicamente separados (no en el mismo gabinete) bajo un mismo sistema operativo, que en conjunto trabajan para una misma aplicacin.
Granja de procesadores (FARM) e imagen de satlite procesada en paralelo.
A continuacin podemos ver otras aplicaciones donde fueron utilizados clusters bajo sistema operativo Linux.
Plot 3-D simulation of space shuttle streamlining
Airflow design for performance and fuel efficiency

Computo Paralelo

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Computo Paralelo

Hochgeladen von

Copyright:

Verfügbare Formate

Historia del Cmputo Paralelo Inicia en el ao de 1955 con Gene Amdahl en los Estados Unidos, trabajando en la compaa IBM.

uno o ms flujos simultneos de instrucciones (instruction stream).

Desempeo. El desempeo de un programa en paralelo se mide bajo tres trminos:

Aceleracin (Speed up) Donde:

Arquitecturas Paralelas. Niveles de abstraccin en la comunicacin.

Memoria Compartida. Hardware de Comunicacin:

Memoria Compartida. Otras opciones en comunicacin:

o Arquitectura de memoria a base de caches compartidas.

Ejemplo: IBM SP-2. Realizado a base de estaciones RS6000.

Ejemplo Intel Paragon.

Redes estticas Hipercubo 3D ciclo-conexo

Redes estticas Ejemplo de conexiones en un hipercubo 3

Conexin de nodos que se diferencian en el segundo bit

Conexin de nodos que se diferencian en el bit ms significativo

Redes dinmicas Redes de interconexin monoetapa

Redes dinmicas Red de barras cruzadas: permite cualquier conexin.

Redes dinmicas Redes de interconexin (multietapa) Cajas de conmutacin

Las cuatro configuraciones posibles de una caja de conmutacin de 2 entradas.

Redes dinmicas bloqueantes Red de lnea base:

Redes dinmicas bloqueantes Red mariposa:

Redes dinmicas bloqueantes Red baraje perfecto:

Redes dinmicas reconfigurables Red de Bene:

Redes dinmicas reconfigurables La red de Bene se puede construir recursivamente:

Redes dinmicas no bloqueantes Red de Clos:

Coherencia en Memoria Cache

Evaluation of the Mesh:

24-node cube-connected cycles network.

CHARACTERISTICS OF VARIOUS PROCESSOR ORGANIZATIONS

Flujo de datos y paralelismo implcito Von Neumann vs. Paralelo

Modelos de computacin distribuida

Faster Project Results

Granja de procesadores (FARM) e imagen de satlite procesada en paralelo.

Plot 3-D simulation of space shuttle streamlining

Airflow design for performance and fuel efficiency

Das könnte Ihnen auch gefallen