Sie sind auf Seite 1von 5

Multi-threading Computer Architectures

George Michael
xeirwn@cs.ucy.ac.cy
Department of Computer Science, University of Cyprus, Nicosia, Cyprus
keywords: Multithreading, Block, Interleaved, Simultaneous, Implicit, Explicit
Abstract

single chip and performance can be achieved without makMultithreading is a programming and execution model, ing the circuits smaller but with the cost of making the
that allows multiple threads to co-exist on the context of resource utilization management more complicated. Multhe same processor and aims to exploit the large number tithreading is a attempt to have a fairly good utilization
of resources available on the chip both to improve per- of such systems and provide a slightly more flexible way of
formance and reduce energy consumption. This survey is distributing workload on the chip. Nowadays most of the
dealing with multithreading in general, what it is, the processors available to the public are either multithreading
computer architectures that support it and other meth- or even better both multithreading and multicore which
ods that can be applied to produce it using software or leads to the necessity of improving their usability.
hardware based synchronization units. Along with the 1.3 What is available out there
various technologies used to execute multi-threaded applications, we will cover also some implementations of As we speak, there is several different implementations of
multi-threading technologies. We will analize the gen- multithreading in the market that are able to handle two
eral types of multithreading which can be categorised or more control threads in parallel inside the pipeline. The
in the following execution types: Explicit:Blocked, Non- implementations that have to deal with multithreaded
Blocking, Interleaved, Simultaneous and Implicit. We will processors fall under two main categories [4]:
also present the basic differences between each type and 1.3.1 Explicit Multithreaded Processors
we will show both hardware and software implementations
Explicit Multithreading Architectures refer to architecthat we found in the papers we read.
tures that can execute threads of one or more processes
1 Introduction
concurrently. These architectures aim at boosting the
1.1 What is multithreading
total performance of a multiprogramming workload used
Multithreading may refer to two concepts. First, it may be by many processes and implies that, although the total
the ability of a single core to handle more than one thread performance of the system improves, the performance of a
at the same time, switching between them whenever one of single thread might not have the same effect and it might
them has I/O operations and stalls. This can be achieved even degrade. As Explicit Multithreading architectures
with additional hardware at the chip that allows it to hold we can call the following types that we will study later on
the states of multiple threads at the same time and by tak- interleaved, blocked, non-blocked and Simultaneous
ing advantage of instruction level parallelism. Second, it multithreading architectures.
may refer to the break down of one or more applications to
threads, which are then scheduled to execute on as many
available as possible with out destroying the application
logic and validity. Size can vary depending both on the
hardware and the application and can be from a single
instruction to a whole section of the application.

1.3.2

Implicit Multithreaded Processors

Implicit architectures refer to architectures that are


able execute several threads from a single sequential
program concurrently. It takes advantage of Instruction
Level Parallelism and because we cannot afford to have
specialised hardware for all cases to handle this, we can
1.2 Why we need this
improve ILP with the help of a compiler as well. Implicit
Everything has a limit. In the previous years, develop- multithreading can increase the performance of a single
ers were trying to improve the performance of their sys- program thread but it might affect and degrade the
tems by using methods like: increasing the clock speed, performance of a multiprogramming environment.
enlarging the caches size, making more powerfull proces1.4 What will become available out there
sors by adding extra hardware to them or hardcoding on
them optimizations for specific commands and thus mak- Many research groups around the world are now working
ing them a lot more complex. These techniques were able on different approaches into properly taking advantage
to sustain the increasing trend of performance for a certain of multithreaded/multicore architectures using data-flow
period but right after that only problems arised. These techniques. Data-Flow is a functional and asynchronous
techniques led us into bumping into several technological execution model where instead of following the control
walls. As an example the frequency wall since frequency flow of an application in order to decide on which thread
could not scale more. The energy consumption of those to issue, we break up the process into chunks that we
systems was too great and it reached the limits of the hard- handle with a Synchronization Unit(in this case each
ware, the heat generated from them was relatively big and chunk is an instruction only). Chunks require certain data
the raw material of the chip would not be able to keep up. as input from other threads or from memory to operate
Another example is the Instruction Level Parellelism wall, and produce data for semantically following threads.
which in simple terms states that we can take advantage of This way we change the order of the chunks and execute
the ILP only up to certain degree, we cannot keep on build- the ones that have their incoming data ready (either if
ing complex hardware that will handle multiple scenarios they are coming from other threads or from the memory)
of incoming instructions just to get minor optimizations and thus overlapping I/O and data-dependency stalls.In
nor we can predict all cases and handle them properly in this survey, we will tackle a couple of software-based
accordance to their context. These and more, led to multi- implementations that have semantics good enough to be
core systems where more than one core can be found on a able to get ported to hardware. Both implemantations are

from the University of Cyprus and they are called TFlux Can cause frequency to be slower, due to hardware
changes and thus degrade perfomance even in cases of
[3] and DDMVM [1] and are based on the DDM model:
one thread.
1.4.1 DDM: Data-Driven Multithreading
DDM is a model based on the data-flow model and differs
mostly from data-flow by having massively larger chunks,
where instead of one instruction, there are multiple. Also,
DDM uses data-driven caching policies to implement
deterministic data prefetching in order to improve locality.

2.1

1.4.2

It allows the operating system to determine when a


context switch should occur. The system may make a
context switch at an inappropriate time, causing lock
convoy, priority inversion or other negative effects.

TFlux: Thread Flux

TFlux is a platform that is able to implement the DDM


model of execution without taking into consideration the
architecture of the host machine (Besides the software
version of this, there is also a Simics based hardware).
It virtualizes the details of the underlying system and is
built on top of a commodity operating system. It provides
runtime support by using a preprocessor tool along with
a set of simple compiler directives that allows the user to
easily develop DDM programs.
1.4.3

DDMVM: Data-Driven
Virtual Machine

Multithreading

Thread Scheduling Methods

Scheduling method we will call the method that is used


to determine when a context switch should occur:
Preemptive Multithreading

Cooperative Multithreading
Relies on the threads themselves to relinquish control once
they are at a stopping point. This can create problems if
a thread is waiting for a resource to become available.
3

Multithreading Taxonomy

We will separate the multithreading architectures under


the categories explicit and implicit, where their main
[2] DDMVM is a virtual machine that supports the DDM difference is that implicit multithreading handles threads
execution model on homogeneous and heterogeneous that derive from only one single sequential process, while
multi-core systems and is implemented as a software explicit can handle concurrent threads from one or more
module. DDMVM virtualizes the parallel resources of processes.
the underlying machine and uses a unified representation
for DDM programs. With the use of a set of C macros 3.1 Explicit Multithreading
it can identify threads, producer-consumer relation- Explicit Multithreading refers to architectures that can
ships amongst the threads and the data produced and execute multiple threads of one or more processes conconsumed by each thread.
currently. The goal of these architectures is to provide an
1.5

Why we are doing this survey

increase in the total performance of a multiprogramming


Our goal is to study multithreading, understand what workload used by many processes (It could degrade the
is is, how it classifies and learn about and use state of performance of a single thread as a tradeoff). Explicit
the art implementations supporting it, as it is a major Multithreading can break down into two categories [4],
contributor in solving the increasing performance issues when:
available at this time.
Issuing from a single thread in a cycle.
Issuing from multiple threads in a cycle.
2 Multithreading
Multithreading is a programming and execution model
that takes advantage of the multiple threads that coexist
and can be derived from within the context of a single
process to improve performance. Although, these threads
share the same process resources, they should be able to
execute independently to be beneficiary for the performance of the process (The above can be applied to larger
grain instruction blocks). This perticular characteristic
that some processes have, allows them to operate faster
on computer systems that have multicores as well. A
root of multithreading is the dataflow model, combining
the instruction-level context switching with sequential
scheduling can perceived as a hybrid dataflow architecture.

We are going to present the basic implementations of


these categories:
Issuing from a single thread in a cycle
3.1.1

Interleaved Multithreading

[4]On every processor cycle, an instruction of another


thread is fetched and added to the execution pipeline,
causing thread context switch. Which rightfully gives it
the name fine-grain multithreading. To perform well
it needs at least as many threads available as the pipeline
stages and it doesnt allow to issue a second instruction
from a thread that is already in the pipeline, unless the
first one retires. These two characteristics can cause degraAdvatages
dation on a single-thread performance. In general, what
Does not stall the whole process whenever a thread has it achieves is that it eleiminates control and data depenmany cache misses (or other I/O or data dependencies) dencies allowing the pipeline design to be simpler. Scalar
architectures can benefit from this architecture since they
and there are sufficient threads available to execute.
Supports multiple threads on the same pipeline and spawn multiple threads that can run concurently.
thus increased utilization of the hardware.
3.1.2 Non-Blocking Multithreading
Allows threads to share memory, since it is on the same
The process is broken down to relatively small chunks of
harware resources, and increase data locality.
instructions that are connected/controlled via producerDisadvatages
consumer relations. In each chunk, the instructions are
Can lead to race conditions if not properly controlled, ordered sequentially and the instructions should not have
might need semafores.
remote memory accesses nor synchronization waits. A
Non-intuitive behaviors of the process might arise and thread is issued only when all dependencies are met (has
require synchronization.
all the input data available) which allows it to execute
On multiprogramming environments it can lead to until completition without interruption. It is also known
degradation of the performance of single process.
as Threaded Data-Flow or Large-Grain Data-Flow.
2

Speculative Multithreaded Processor - It uses controlBlocked Multithreading


speculation both amongst threads and inside threads
[4]A single thread executes on the pipeline alone until
(conventional speculation). It dynamically extracts
it reaches a situation that will trigger a context switch.
threads from loops speculatively without the intervenAt that time, another thread will start executing on the
tion.
pipeline having it exclusively allocated. A blocked thread

Speculative Data-Driven Multithreading - Identifies


can resumeonly after the blocking reason is cleared. It
instructions that can introduce latency or pipeline
is called also Coarse-Grain Multithreading because
delays as critical instructions. Such instructions are
in repsect to Interleaved Multithreading, it needs less
speculatively forked to execute separately along side
threads to saturate the pipeline due to the fact that it has
the main thread, thus tolerating the latency caused by
multiple instructions in the pipeline per thread instead of
such instructions.
just one at any time.
4 Data-Driven Multithreading Implementations
Models of Blocked Multithreading
3.1.3

Static: Using special instructions, generated from


the compiler to cause context switch, or by assigning
instructions to classes and switching on classes we
believe will create long delays. Allows us to hide
latency of fetch stage by switching early but it is not
effective when data is already in the cache.
Dynamic: Occurs when an instruction in the pipeline
causes a long latency event. It has the advantage of
switching only on cache misses but this is only discovered late on the pipeline, causing many subsequent
instructions that are in the pipeline to be discarded as
well.
Issuing from multiple threads in a cycle
3.1.4

Simultaneous Multithreading

[4]Combines multithreading with superscalar techniques.


Requires several hardware contexts and register sets on
the processor and allowes issuing instructions from several
threads simultaneously. It tolerates the one thread execution latencies by issuing instructions from other threads.
SMT is more efficient with superscalar architecture as it
takes advantage of both Instruction Level Parallelism and
Thread Level Parallelism. It has better performance for
multithreaded/multi-program applications since it takes
advantage of the dependencies from threads of the same
applications or threads from different applications.
Simultaneous Multithreading Implementations
Resource Sharing: Where instructions from different
threads share most of the resources, minimal hardware
complexity added to conventional superscalars, uses
tags to distinguish between instances of threads.
Resource Replication: Replicates all internal resources
of the superscalar, binds a buffer to each thread, can
issue instructions of different instruction windows
simultaneously.
3.2 Implicit Multithreading
[4]Speculating at thread-level, Implicit Multithreading
tries to concurrently execute several threads from a single
sequential program. It dynamically generates threads and
executes them speculatively. It can be used to increase the
performance of a single thread process. These processors
are closer to CMPs as many of them split execution into
several closely coupled execution units and each execution
unit is responsible for its own thread execution.
3.3 Impicit Multithreading Examples

For the second part, that will take place after delivering
this part, of this survey we will implement a couple of applications using TFlux and DDMVM (Described Later) to
get a grip on the advantages and dissadvantages of modern
programming languages. The following section is from [2].
Data-Driven Multithreading
This model is based on the Data-flow model, where an
instruction is scheduled for execution when all the data
it needs are available. The DDM model applies the
same principle but on a larger granularity. Instead of
scheduling a single instruction, it schedules one thread (a
group of instructions) when all its input data have been
produced and placed in the processors local memory.
Thus, no synchronization nor communication latencies
are experienced once the execution of the thread starts.
DDM decouples the synchronization and computation
parts of a program and overlaps them to tolerate synchronization and communication latencies. DDM utilizes
data-driven caching policies to implement deterministic
data prefetching which improves the locality of sequential
processing. The core of the DDM implementation is the
Thread Synchronization Unit (TSU) which is responsible
for the scheduling of threads at run-time based on data
availability. A DDM program consists of several threads
that have producer-consumer relationships and are
grouped into DDM Blocks. A DDM block is equivalent
to a loop or a function in high level languages. The TSU
schedules a thread to run only after all the producers
of this thread have completed execution, which ensures
that all the data this thread needs is available. Once the
execution of a thread starts, instructions within a thread
are fetched by the CPU sequentially in control-flow
order, thus exploiting any optimization available by the
CPU hardware. The threads are identified by the tuple:
ThreadID, which is static, and Context which is dynamic.
Each thread is paired with its synchronization template
or meta-data specifying the following attributes:

The Instruction Frame Pointer (IPF): points to the


address of the first instruction of the thread.
The ReadyCount (RC): a value equal to the number
of producer-threads this thread needs to wait for until
starting to execute.
The Data Frame Pointer List (DFPL): a list of pointers
to the data inputs assigned for the thread.
The Consumer List (CL): a list of this threads consumers that is used to determine which ReadyCount
Multiscalar Processors - Is a collection of execution units
values to decrement after the thread completes its
and a hardware sequencer assigning tasks to units specexecution. The synchronization templates of all the
ulatively. Supports both control and data speculation.
threads in the DDM program constitute the datadriven
Trace Processors - Breaks the process into multiple
synchronization graph which is used by the TSU for
traces (a trace is a dynamic sequence of instructions
scheduling threads.
captured dynamically by the hardware and contained in
4.1 TFlux
a specialized trace cache) that is executed on multiple
cores. One core executes the current trace while the TFlux is a platform that supports the DDM model of
others execute future traces speculatively.
execution independently of thearchitecture by virtualizing
3

any details of the underlying system. It provides a runtime supportthat is built on top of a commodity operating
system and a preprocessor tool along with a set of simple
compiler directives that allows the user to easily develop
DDM programs. TFlux targetedsimulated homogeneous
multi-cores with a hardware TSU and focused on developing aportable software platform that could run on commercial multi-core systems utilizing a software-implemented
TSU. The speedup achieved by the TFlux platform was
close to linear. Moreover, the speedup was stable across
the different targeted platforms thus allowing the benefits
of DDM to be exploited on different commodity systems.
// Libraries
#include <stdio.h>
// Main Program
int main(int argc, char *argv[])
{
int x, y, z, k;
#pragma ddm kernel 2
#pragma ddm startprogram
x = 4;
y = 8;
// Block 1
#pragma ddm block 1
import (int x, int y,int z) export (x:4,y:5,z:3)
/*
* Import: defines
that variable x should be imported from
the DThread 0. This means that the variable
x has been modified not by a DThread of this
block but rather, by code outside the block.
* Export: due to the fact that variable
x is used by code that follows DThread 1
*/
#pragma
ddm thread 1 kernel 1 import(x:0) export(int x)
x++;
#pragma ddm endthread
#pragma
ddm thread 2 kernel 2 import(y:0) export(int y)
y++;
#pragma ddm endthread
// Defining two DThreads to the same Kernel
will result to serializing their execution
#pragma ddm
thread 3 kernel 2 import(x:1,y:2) export(int z)
z = x + y;
#pragma ddm endthread
#pragma ddm
thread 4 kernel 1 import(z:3,x:1) export(int x)
x *= z;
#pragma ddm endthread
#pragma ddm
thread 5 kernel 2 import(z:3,y:2) export(int y)
y *= z;
#pragma ddm endthread
// End of Block 1
#pragma ddm endblock
// Print results
printf("x: %d\ny: %d\nz: %d\n", x, y, z);
#pragma ddm endprogram
return 0;
}

4.2

DDMVM

consist of two parts: the code of the DDM threads and the
synchronization graph describing the consumer-producer
dependencies amongst the threads.
The DDM-VM
architecture currently has two implementations. The
Data-Driven Multithreading Virtual Machine for the
Cell (DDM-VMc) and the Data-Driven Multithreading
Virtual Machine for Symmetric Multi-cores (DDM-VMs).
Next section we give an overview of the DDM-VMc
and the rest of this thesis focus on the specification,
implementation and evaluation of the DDM-VMs.
int main(int argc, char **argv)
{
memset(&global_info,0,sizeof(global_info));
INIT_DDMSYSTEM(&global_info);
InitializeData();
my_id = GetNodeId();
int short ConsList[3]= {3,2,5};
SystemSync();
DVM_SET_THREAD_TEMPLATE(THREAD_2,
1, 1, THREAD_3, (int short*)0, 0, DVM_MODULAR,
MASK_INDX, DVM_ARITY_1, SM_DEFAULT, 0);
DVM_SET_THREAD_TEMPLATE(THREAD_2,
1, 1, THREAD_3, (int short*)0, 0, DVM_MODULAR,
MASK_INDX, DVM_ARITY_1, SM_PERFECT,
S_CreateContext(make_mask(DIM*DIM-1)));
DVM_SET_THREAD_TEMPLATE(THREAD_3,
1, 3, 0, ConsList, 3, DVM_MODULAR,
MASK_CNTX, DVM_ARITY_2, SM_DEFAULT, 0);
DVM_EXECUTE();
if (GetNodesCount() > 1)
gather_data(0);
if (my_id == 0)
if (verifyData())
printf("@Total %f\n",elapsed_time);
else
printf("@Incorrect\t%f\n",elapsed_time);
SHUTDOWN_DDMSYSTEM();
NIUstop();
return 1 ;
}
DVM_CACHEFLOW_DFPL_START();
int indx,cntx;
int rowA,rowB ;
int colA,colB ;
g_address g;
g.node = GetNodeId() ;
DVM_START_DFPL(THREAD_3)
GET_CONTEXT_D(In->Context, cntx, indx);
rowA = cntx/DIM;
colA = indx;
rowB = indx;
colB =cntx%DIM;
g.lAddress=(long)A[(rowA*DIM+colA)];
DVM_SET_DFP(g, BSIZE*BSIZE*sizeof(DATA_TYPE),
CMD_MODE_READ|CMD_MODE_REUSE|CMD_MODE_WRITE);
DVM_END_DFPL()
DVM_CACHEFLOW_DFPL_END();
int threads_main(unsigned int speid)
{
INIT_RUNTIME(speid);
int i,j,k;
int indx,cntx;
DVM_SET_IPF(THREAD_2);
DVM_SET_IPF(THREAD_3);
DVM_THREAD_END();

The Data-Driven Multithreading Virtual Machine DVM_THREAD_START(THREAD_3);


(DDM-VM) is a virtual machine that supports DDM
execution on homogeneous and heterogeneous multi-core
GET_CONTEXT_D(DVM_CONTEXT,cntx,indx);
systems. DDM-VM virtualizes the parallel resources
i = cntx / DIM;
of the underlying machine and uses a general, unified
j = cntx % DIM;
k = indx;
representation for DDM programs. DDM-VM programs
4

DATA_TYPE *pA = (DATA_TYPE *)DVM_LOOKUP(0);


DATA_TYPE *pB = (DATA_TYPE *)DVM_LOOKUP(1);
DATA_TYPE *pC = (DATA_TYPE *)DVM_LOOKUP(2);
//printf("[%d]Thread
3-%d-%d\n",DVM_CORE_ID,cntx,indx);
matmul(pA,pB,pC);
//matmul(A[i*DIM+k],B[k*DIM+j],C[i*DIM+j]);
if (indx<(DIM-1))
DVM_UPDATE(CONS_1,OP_FIN|OP_INC_INDX,1);
else
if ( cntx < DIM*DIM-DVM_CORE_NUM*SIM_ITER_NUM)
DVM_UPDATE(CONS_2, OP_FIN|OP_SET_CONTEXT,
cntx+DVM_CORE_NUM*SIM_ITER_NUM);
else
DVM_UPDATE(CONS_NONE,OP_FIN,0);
DVM_THREAD_END();
}

Summary

Explicit multithreaded processors interleave the execution


of instructions of different user-defined threads within
the same pipeline, in contrast to implicit multithreaded
processors that dynamically generate threads from singlethreaded programs and execute such speculative threads
concurrently with the lead thread. Superscalar and
implicit multithreaded processors aim at a low execution
time of a single program, while explicit multithreaded
processors (and chip multiprocessors) aim at a low
execution time of a multithreaded workload.
References
[1] S. Arandi and P. Evripidou. Ddm-vmc : the data-driven
multithreading virtual machine for the cell processor.
In M. Katevenis, M. Martonosi, C. Kozyrakis, and
O. Temam, editors, HiPEAC, pages 2534. ACM, 2011.
[2] G. Michael. Ddm-vms: Data-driven multithreading virtual
machine for symmetric multi-cores.
Masters thesis,
University of Cyprus, 2010.
[3] K. Stavrou, M. Nikolaides, D. Pavlou, S. Arandi, P. Evripidou, and P. Trancoso. Tflux: A portable platform for
data-driven multithreading on commodity multicore
systems. In Parallel Processing, 2008. ICPP 08. 37th
International Conference on, pages 25 34, sept. 2008.
[4] T. Ungerer, B. Robic, and J. Silc. A survey of processors with explicit multithreading. ACM Comput. Surv.,
35(1):2963, 2003.

Das könnte Ihnen auch gefallen