Beruflich Dokumente
Kultur Dokumente
George Michael
xeirwn@cs.ucy.ac.cy
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
keywords: Multithreading, Block, Interleaved, Simultaneous, Implicit, Explicit
Abstract
single chip and performance can be achieved without makMultithreading is a programming and execution model, ing the circuits smaller but with the cost of making the
that allows multiple threads to co-exist on the context of resource utilization management more complicated. Multhe same processor and aims to exploit the large number tithreading is a attempt to have a fairly good utilization
of resources available on the chip both to improve per- of such systems and provide a slightly more flexible way of
formance and reduce energy consumption. This survey is distributing workload on the chip. Nowadays most of the
dealing with multithreading in general, what it is, the processors available to the public are either multithreading
computer architectures that support it and other meth- or even better both multithreading and multicore which
ods that can be applied to produce it using software or leads to the necessity of improving their usability.
hardware based synchronization units. Along with the 1.3 What is available out there
various technologies used to execute multi-threaded applications, we will cover also some implementations of As we speak, there is several different implementations of
multi-threading technologies. We will analize the gen- multithreading in the market that are able to handle two
eral types of multithreading which can be categorised or more control threads in parallel inside the pipeline. The
in the following execution types: Explicit:Blocked, Non- implementations that have to deal with multithreaded
Blocking, Interleaved, Simultaneous and Implicit. We will processors fall under two main categories [4]:
also present the basic differences between each type and 1.3.1 Explicit Multithreaded Processors
we will show both hardware and software implementations
Explicit Multithreading Architectures refer to architecthat we found in the papers we read.
tures that can execute threads of one or more processes
1 Introduction
concurrently. These architectures aim at boosting the
1.1 What is multithreading
total performance of a multiprogramming workload used
Multithreading may refer to two concepts. First, it may be by many processes and implies that, although the total
the ability of a single core to handle more than one thread performance of the system improves, the performance of a
at the same time, switching between them whenever one of single thread might not have the same effect and it might
them has I/O operations and stalls. This can be achieved even degrade. As Explicit Multithreading architectures
with additional hardware at the chip that allows it to hold we can call the following types that we will study later on
the states of multiple threads at the same time and by tak- interleaved, blocked, non-blocked and Simultaneous
ing advantage of instruction level parallelism. Second, it multithreading architectures.
may refer to the break down of one or more applications to
threads, which are then scheduled to execute on as many
available as possible with out destroying the application
logic and validity. Size can vary depending both on the
hardware and the application and can be from a single
instruction to a whole section of the application.
1.3.2
from the University of Cyprus and they are called TFlux Can cause frequency to be slower, due to hardware
changes and thus degrade perfomance even in cases of
[3] and DDMVM [1] and are based on the DDM model:
one thread.
1.4.1 DDM: Data-Driven Multithreading
DDM is a model based on the data-flow model and differs
mostly from data-flow by having massively larger chunks,
where instead of one instruction, there are multiple. Also,
DDM uses data-driven caching policies to implement
deterministic data prefetching in order to improve locality.
2.1
1.4.2
DDMVM: Data-Driven
Virtual Machine
Multithreading
Cooperative Multithreading
Relies on the threads themselves to relinquish control once
they are at a stopping point. This can create problems if
a thread is waiting for a resource to become available.
3
Multithreading Taxonomy
Interleaved Multithreading
Simultaneous Multithreading
For the second part, that will take place after delivering
this part, of this survey we will implement a couple of applications using TFlux and DDMVM (Described Later) to
get a grip on the advantages and dissadvantages of modern
programming languages. The following section is from [2].
Data-Driven Multithreading
This model is based on the Data-flow model, where an
instruction is scheduled for execution when all the data
it needs are available. The DDM model applies the
same principle but on a larger granularity. Instead of
scheduling a single instruction, it schedules one thread (a
group of instructions) when all its input data have been
produced and placed in the processors local memory.
Thus, no synchronization nor communication latencies
are experienced once the execution of the thread starts.
DDM decouples the synchronization and computation
parts of a program and overlaps them to tolerate synchronization and communication latencies. DDM utilizes
data-driven caching policies to implement deterministic
data prefetching which improves the locality of sequential
processing. The core of the DDM implementation is the
Thread Synchronization Unit (TSU) which is responsible
for the scheduling of threads at run-time based on data
availability. A DDM program consists of several threads
that have producer-consumer relationships and are
grouped into DDM Blocks. A DDM block is equivalent
to a loop or a function in high level languages. The TSU
schedules a thread to run only after all the producers
of this thread have completed execution, which ensures
that all the data this thread needs is available. Once the
execution of a thread starts, instructions within a thread
are fetched by the CPU sequentially in control-flow
order, thus exploiting any optimization available by the
CPU hardware. The threads are identified by the tuple:
ThreadID, which is static, and Context which is dynamic.
Each thread is paired with its synchronization template
or meta-data specifying the following attributes:
any details of the underlying system. It provides a runtime supportthat is built on top of a commodity operating
system and a preprocessor tool along with a set of simple
compiler directives that allows the user to easily develop
DDM programs. TFlux targetedsimulated homogeneous
multi-cores with a hardware TSU and focused on developing aportable software platform that could run on commercial multi-core systems utilizing a software-implemented
TSU. The speedup achieved by the TFlux platform was
close to linear. Moreover, the speedup was stable across
the different targeted platforms thus allowing the benefits
of DDM to be exploited on different commodity systems.
// Libraries
#include <stdio.h>
// Main Program
int main(int argc, char *argv[])
{
int x, y, z, k;
#pragma ddm kernel 2
#pragma ddm startprogram
x = 4;
y = 8;
// Block 1
#pragma ddm block 1
import (int x, int y,int z) export (x:4,y:5,z:3)
/*
* Import: defines
that variable x should be imported from
the DThread 0. This means that the variable
x has been modified not by a DThread of this
block but rather, by code outside the block.
* Export: due to the fact that variable
x is used by code that follows DThread 1
*/
#pragma
ddm thread 1 kernel 1 import(x:0) export(int x)
x++;
#pragma ddm endthread
#pragma
ddm thread 2 kernel 2 import(y:0) export(int y)
y++;
#pragma ddm endthread
// Defining two DThreads to the same Kernel
will result to serializing their execution
#pragma ddm
thread 3 kernel 2 import(x:1,y:2) export(int z)
z = x + y;
#pragma ddm endthread
#pragma ddm
thread 4 kernel 1 import(z:3,x:1) export(int x)
x *= z;
#pragma ddm endthread
#pragma ddm
thread 5 kernel 2 import(z:3,y:2) export(int y)
y *= z;
#pragma ddm endthread
// End of Block 1
#pragma ddm endblock
// Print results
printf("x: %d\ny: %d\nz: %d\n", x, y, z);
#pragma ddm endprogram
return 0;
}
4.2
DDMVM
consist of two parts: the code of the DDM threads and the
synchronization graph describing the consumer-producer
dependencies amongst the threads.
The DDM-VM
architecture currently has two implementations. The
Data-Driven Multithreading Virtual Machine for the
Cell (DDM-VMc) and the Data-Driven Multithreading
Virtual Machine for Symmetric Multi-cores (DDM-VMs).
Next section we give an overview of the DDM-VMc
and the rest of this thesis focus on the specification,
implementation and evaluation of the DDM-VMs.
int main(int argc, char **argv)
{
memset(&global_info,0,sizeof(global_info));
INIT_DDMSYSTEM(&global_info);
InitializeData();
my_id = GetNodeId();
int short ConsList[3]= {3,2,5};
SystemSync();
DVM_SET_THREAD_TEMPLATE(THREAD_2,
1, 1, THREAD_3, (int short*)0, 0, DVM_MODULAR,
MASK_INDX, DVM_ARITY_1, SM_DEFAULT, 0);
DVM_SET_THREAD_TEMPLATE(THREAD_2,
1, 1, THREAD_3, (int short*)0, 0, DVM_MODULAR,
MASK_INDX, DVM_ARITY_1, SM_PERFECT,
S_CreateContext(make_mask(DIM*DIM-1)));
DVM_SET_THREAD_TEMPLATE(THREAD_3,
1, 3, 0, ConsList, 3, DVM_MODULAR,
MASK_CNTX, DVM_ARITY_2, SM_DEFAULT, 0);
DVM_EXECUTE();
if (GetNodesCount() > 1)
gather_data(0);
if (my_id == 0)
if (verifyData())
printf("@Total %f\n",elapsed_time);
else
printf("@Incorrect\t%f\n",elapsed_time);
SHUTDOWN_DDMSYSTEM();
NIUstop();
return 1 ;
}
DVM_CACHEFLOW_DFPL_START();
int indx,cntx;
int rowA,rowB ;
int colA,colB ;
g_address g;
g.node = GetNodeId() ;
DVM_START_DFPL(THREAD_3)
GET_CONTEXT_D(In->Context, cntx, indx);
rowA = cntx/DIM;
colA = indx;
rowB = indx;
colB =cntx%DIM;
g.lAddress=(long)A[(rowA*DIM+colA)];
DVM_SET_DFP(g, BSIZE*BSIZE*sizeof(DATA_TYPE),
CMD_MODE_READ|CMD_MODE_REUSE|CMD_MODE_WRITE);
DVM_END_DFPL()
DVM_CACHEFLOW_DFPL_END();
int threads_main(unsigned int speid)
{
INIT_RUNTIME(speid);
int i,j,k;
int indx,cntx;
DVM_SET_IPF(THREAD_2);
DVM_SET_IPF(THREAD_3);
DVM_THREAD_END();
Summary