Beruflich Dokumente
Kultur Dokumente
52
Acest tip de arhitectura are o complexitate de 16 = ff mare imposibil de testat in
totalitate. Acest tip de sisteme se pot optimiza doar prin intermediul algoritmilor genetici
testandu-se in mod aleator posibile combinatii.
Pareto efficiency:
When you change something to an individual and you make him better without making
any other individual worse is called a Pareto improvement or a Pareto-optimal move.
An allocation is defined as "Pareto efficient" or "Pareto optimal" when no further Pareto
improvements can be made.
Domination relation: no order can be established between points a and b (see figure) but
both a and b dominate c
-2-
1. generate the first population randomly
2. evaluate all the individuals
3. evaluate if the individuals have broken some rules
4. while the maximum number of evaluations has not been reached do:
a. …
5. STOP
The Pareto principle (also known as the 80–20 rule, the law of the vital few, and
the principle of factor sparsity) states that, for many events, roughly 80% of the
effects come from 20% of the causes.
-3-
Fig 2 depicts a Pareto set for a two-objective minimization problem. Potential
solutions that optimize f1 and f2 are shown on the graph.
c. PDF: 1c.pdf
-4-
-5-
-6-
Subiectul 2.
a) Explicati performanta superioara (IPC) a modelului de reutilizare Sv
fata de modelul Sn.
These two schemes are used to implement dynamic reuse. These schemes mainly differ in
the way in which reusable results are identified.
The first scheme (Sv) tracks operand values for each instruction, the second scheme
(Sn) tracks only operand names (register identifiers).
(the third(Sn+d) and the fourth scheme (Sv+d) extend the first two schemes by the use of
dependence relationships among the instructions for tracking reuse).
There are a couple of issues with using these schemes. First of all we have to take a look
at the type of information that we store in the RB(Reuse Buffer). Second how we will
know that we can reuse those values store in the RB. And third how does the information
from the RB gets updated or invalidated.
B. Format 2 :
C. Format 3 :
-7-
TAG – might be represented by the instruction’s PC;
OP1, OP2 – represent the value or name of source registers used by the instruction;
RESULT – represents the actual result of the instruction, which will be reused in the case
of a “hit” in the RB;
MEM_VALID – indicates if the value from the “RESULT” field is reusable in the case
of a Load instruction. The bit is set when the Load instruction is written into the RB. The
bit is reset by any Store instruction that has the same access address. Thus, the reuse of
arithmetic/logic instructions is assured if RES_VALID=1.
If RES_VALID=1, also guarantees the correct address for any Load/Store instruction
and exempts the processor from its computation (indexed addressing ó [Register +
Offset]). On the other hand, the result of a Load instruction can be reused only if
MEM_VALID=1 AND RES_VALID=1.
RB entry: The tag field stores part of the PC. The result, operand value1 and
operand value2, store the result and the operand values of the instruction. These fields
are used to identify the instruction (or address calculation in case of a load/store)
that can be reused.
The memvalid bit and the address field are used to determine if the actual
memory access for a load instruction can be reused; the memvalid bit indicates
whether the value loaded from memory (present in the result field) is valid, and
the address field stores the memory address (i.e., the outcome of the address
calculation).
-8-
Reuse test: For testing reuse, the operands of an instruction are compared
with the values in the operand value fields of the RB entry.
A match indicates that result is valid (for non-load/ store instructions) or
address is valid (for loads and stores). For loads, in addition to testing the validity
of the address bits, we also need to test the memvalid bit to see if the outcome of
the load (in the result field) can be reused. If the operand values are not known at the
time of the reuse test then the instruction is not reused.
Invalidation: For non-load operations, the reuse test works because the
operands uniquely determine the result and therefore invalidations are not needed to
maintain the integrity of the test.
For loads, a store to the same address invalidates the value in the result field.
Accordingly, on a store the address field of each RB entry is searched for a matching
address, and the memvalid bit reset for matching entries.
Note that the address field, memvalid field, and the associative search for
invalidations are required only to maintain the integrity of load values.
The RB can be split into two buffers: one for storing load values and
another, the main RB, for storing everything except the load values (including
entries for load addresses).
In scheme Sn, we attempt to trivialize the reuse test (and also to reduce the size of
each RB entry). Rather than store operand values, we store operand (architectural)
register identifiers in the RB.
When an instruction writes into a register, all instructions with a matching
(source) register identifier in the RB are invalidated. Only the valid instructions are
reused from the RB.
The advantage of this reuse test is that it can be done much earlier in the pipeline
than the reuse test in scheme Sv since it does not require the operand values.
Since the reuse test is based on operand names (and not value), we call this
scheme Sn, where ‘n’ stands for name.
Reuse test: The reuse test is as simple as testing the state of resultvalid and
memvalid bits.
Address calculation for load/store instructions and results for all other instructions can be
reused if the resultvalid bit is set; the result of a load instruction can be reused if both
resultvalid and memvalid are set. (Since different instances of the same static instruction
-9-
will have the same operand names, we do not need to compare the operand names
explicitly for reuse.)
As mentioned above, since this reuse test does not require operand values, it can be
potentially done earlier in the pipeline; this may result in the reuse being more beneficial.
Invalidations : As before, stores invalidate the loads from the same address
(memvalid bit is reset). Moreover, when a register is written, the RB is searched for
entries whose operand field matches the name of the register. The entries that match are
marked invalid (resultvalid bit is reset).
Suplimentar(nu e in subiecte):
Scheme Sn+d: Reuse using register names and dependence chains
-10-
when their operands registers are overwritten (resultvalid is reset). Dependent
instructions need not be invalidated on operand overwrites because their reuse status can
be established using their dependence information. Instead, they are invalidated when
their source instructions are evicted from the RB, i.e., when the dependence information
is lost. To perform this operation the RB needs to be searched for entries whose src-index
field matches the index (in the RB) of the source instruction being evicted. The entries
which result in a match are invalidated (resultvalid bit is reset).
Here is a reuse example with Sn+d:
Although the scheme Sv is the most accurate in detecting the reusable instructions
among
the three schemes presented so far, it is not very well suited for reusing chains of
dependent instructions in a single cycle. For example, reusing two instructions, I and J,
with J being dependent on I, would require that we first reuse I and then using the reused
result of I we perform the reuse test for J. This whole operation may be difficult to do in a
single cycle, especially for long dependence chains. To facilitate the reuse of dependent
instructions, we augment the scheme Sv with the dependence-tracking ability of scheme
Sn+d, giving us thescheme Sv+d. As in scheme Sn+d, instructions in this scheme are
stored in the RB with pointers to the RB entries containing their source instructions.
RB entry: An RB entry is similar to the one in scheme Sv, except for the addition
of a src-index field. Just like in scheme Sn+d, the dependence links are created by storing
the RB index of the source instructions in this field. An invalid value is inserted in this
field if the source doesn’t exist in the RB.
Reuse test: The reuse status of independent instructions is established as in
scheme Sv : the operand values are compared with the current values of those registers
and the memvalid bit is used to determine the validity of loads. As in scheme Sn+d, a
dependent instruction is reused by confirming that its source instructions (in the RB), as
-11-
indicated by the src-index field of its operands, are indeed the latest producers for its
operands. This fact is established with the help of the RST.
State updates: As in other schemes, stores invalidate the loads to the same
address (memvalid is reset). As in scheme Sn+d, the state of dependent instructions is
updated when their source instructions are evicted from the RB, i.e., when their
dependence information is lost. The state can be updated in two ways: either (i) the
dependent instructions can be marked invalid, or (ii) their src-index fields, pointing to the
evicted source, are annulled (and thereafter, they are treated like independent instructions
— i.e., their validity is determined by value comparison). The first option is simple but
conservative since it invalidates potentially useful instructions. The second option, on the
other hand, retains the dependent instructions, but it requires additional space in RB
entries since the operand values need to stored for the dependent instructions as well (so
that value comparison can be performed if the dependent instructions become
independent). Nevertheless, both update operations require that the RB be searched for
the entries whose src-index field matches the RB index of the source instruction being
evicted. These matching entries are either invalidated or converted into independent
entries.
-12-
A contextual predictor predicts the next value
based on a particular stored pattern (context) that is repetitively generated in
the value sequence, in a markovian stochastic manner. Theoretically they
can predict any repetitive value sequences. A context predictor is of order k
if its context information includes the last k values, and, therefore, the
search is done using this pattern of k values length. As we already pointed
out, a contextual predictor of order k derives from the k-value locality metric
that represents an idealised k-context predictor.
In this case the prediction will be done based on the most frequent value that
follows a pattern context in the string of history values.
RESTUL, BALARII…
- Measurements using SPEC benchmarks shows that value locality on Load instructions
is about 50% using a history of one (producing the same value like the previous one)
respectively 80% using a history of 16 previous instances.
- The concept is strongly related with the redundant computing concepts (like the
memorization technique) including here the introduced Dynamic Instruction Reuse
technique.
Value Locality -> Value Predictability
However, value locality and value predictability is not the same concept. You can
have 100% locality and be very unpredictable (as a simple example, a random sequence
of 0s and 1s has 100% with history of two values but can be very unpredictable). More
general: if the values sequence: is not a Markov process.
Nu orice secventa predictibila de valori deriva din conceptul de localitate a valorilor:
Ex: i++, 1, 2, 3, 4, 5, 6, 7, ?
Why Value Locality?
Data redundancy – the input data sets for general – purpose programs are redundant
(sparse matrices, file texts – with many blanks and many repetitive characters, free cells
in table calculus, etc).
Exploiting compiler error tables when there are generated repetitive errors.
Program constants, meaning that is more efficient to load program constants from
memory than constructing them as immediate operands.
In case – swith constructions, it is needed the repetitive load of a constant (branch's
base address)
Virtual function calls – loaded a function pointer that is a constant during the run-
time. Similar in object oriented programming is polymorphism's implementation
Computed branches – for calculating a branch address it is necessary to load a register
with the base address for the branch jump table, which might be a run-time constant.
Register spill code – when all the CPU registers are busy, variables that may remain
constant are spilled to data-memory and loaded repeatedly.
-13-
Polling algorithms – the most likely outcome is that the event being interrogated for has
not yet occurred, involving the redundant computation to repeatedly check for the event,
etc.
Requirements:
- Prediction and speculation need dedicated mechanisms for:
- Detecting mispredicted values and chacking the prediction's accuracy.
- Processor's context recovery after a miss-prediction (ROB)
- Issuing dependent instructions speculatively (involving the standatd out-of-order logic
with some minor modifications)
- Storing and bypassing predicted values for the next dependent processed instructions.
This speculative mechanism is the main VP's advantage.
-14-
-15-
-16-
-17-
-18-
-19-
Two-Level Adaptive Branch Prediction uses
two levels of branch history information to make a branch prediction. The
first level consists of a History Register (HR) that records the outcome of
the last k branches encountered. The HR may be a single global register,
HRg, that records the outcome of last k branches executed in the dynamic
instruction stream or one of multiple local history registers, HRl, that record
the last k outcomes of each branch. The second level of the predictor,
known as the Pattern History Table (PHT) records the behaviour of a branch
during previous occurrences of the first level predictor. It consists of an
array of two-bit saturating counters, one for each possible entry in the HR.
2k entries are therefore required if a global PHT is provided, or many times
this number if a separate HR and therefore PHT is provided for each branch.
Subiectul 4:
-20-
The branch prediction problem consists of two sub-problems:
firstly generating the correct prediction and secondly in the case of a
taken branch predicting the correct branch target.
-21-
Both researchers (Jimenez and Vintan) conclude that greater
correlations are achieved by neural predictors than two-level
predictors and greater prediction accuracy can be achieved. Jimenez
showed that his predictor achieved a misprediction rate of 1.71%,
which equates to 36% fewer mispredictions than a McFarling style
hybrid two-level predictor [18].
Vintan showed that his predictor achieved a misprediction rate of
about 11%, which equates to 3% improvement in the misprediction
rate for his
neural predictor over a conventional two-level predictor.
Neural networks branch prediction has a linear growth
compared to two-level adaptive branch predictors which has an
exponential growth.
-22-
-23-
-24-
Subiectul 5
1. Problema coerentei cash-urilor in sist. multiprocessor
-25-
Pentru a intelege aceste avantaje este necesar sa explicam in primul rand caracteristicile
protocolului MSI.
Acesta este practic un protocol de invalidare pentru scrierea ulterioara ”write-back” a
cash-urilor.
Protocolul MESI aduce nou o a patra stare exclusive pentru a reduce traficul creat de
scrierea unui bloc ce exista decat in unul dintre cash-uri (atunci cand sunt modificate
datele dintr-un cash iar celelalte devin neconsistente).
-26-
Aceasta stare este asemanatoare cu cea Shared in sensul ca poate sa stocheze o copie a
celor mai recente date (datele corecte) .
Cu starea Modified se aseamana prin faptul ca acea copie din memoria principala poate
sa fie incorecta.
Clasificarea unui bloc cash utilizand protocolul MOESI se face infunctie de urmatoarele
caracteristici:
- Validity
- Exclusiveness
- Ownership
Aceasta metoda permite evitarea scrierii datelor modificate in memorie inainte ca ele sa
fie transmise catre celelalte cash-uri.
In procesele atomice se utilizeaza atomic-locks, care atunci cand sunt utilizate blocheza
mai multe variabile in acelasi timp . Daca nu pot fi blocate toate atunci nu mai este
blocata nici una.
In acest mod se exclude posibilitatea unui deadlock, ce ar fi putut sa apara de exemplu
atunci cand un thread ar bloca prima variabilea iar cel de-al doilea thread pe a doia
variabla. In acest mod nici unul dintre cele doua thread-uri u ar fi completat.
Dezavantaje:
- Creaza blocaje, in sensul in care anumite thread-uri sunt obligate sa astepte
pana cand un lock este eliberat;
- Maresc complexitatea programelor;
- Prioritate: thread-urile cu priporitate mai mare nu pot fi executate daca cele
cu prioritate mai mica au loc pe anumite resurse ce sunt necesitate;
- Greu de realizat procesul de debugging deoarece bugg-urile sunt dependente
de timp.
If an operation requires multiple CPU instructions, then it may be interrupted in the middle of
executing. If this results in a context switch (or if the interrupt handler refers to data that was
being used) then atomicity could be compromised. It is possible to use any standard locking
technique (e.g. aspinlock) to prevent this, but may be inefficient. If it is possible, disabling
interrupts may be the most efficient method of ensuring atomicity (although note that this may
increase the worst-case interrupt latency, which could be problematic if it becomes too long).
-27-
-28-
-29-
- MPI_Reduce: un singur proces (procesul radacina) colecteaza datele de
la celelalte procese dintr-un grup si le combina pe baza unei operatii intr-o
singura data
- MPI_Barrier : blocheaza procesele comunicatorului comm pe masura ce
functia este apelata de catre acestea pana cand toate procesele comunicatorului
au apelat aceasta functie. La revenirea din functie toate procesele sunt
sincronizate.
-30-
i. 1.) finite state machine automata
1. Is not a mathematical proof for convergence, but only a
empirical algorithm
2. 2.) polynomial rising complexity (to exploit n bits of
history, a 2^n history table is needed)
ii. Proposal: perceptron branch predictor
1. Perceptron based learning/prediction algorithm
2. Perceptron is also feasible to be implemented in hardware
3. Dynamic instruction reuse
a. Why is there a significant degree of instruction reuse in programs? (e.g.
polymorphism)
b. How could this be efficiently exploited in a superscalar processor? (Sodani
& Sohi -> reuse buffer structure)
c. Ideas about extending Dynamic Instruction Reuse to function reuse at high
level language (= Memoization)
memoization that
saved the function’s result in a table. If the function is called again with the
same parameters then its result is reused from the table instead of reevaluation
Memoization is also used to reduce the running time of some
optimized compilers where the same data dependence test is carried out
repeatedly.
-31-
c. Hybrid prediction (multiple value predictors to work together -> Meta-
Predictor)
5. Advanced Prediction Methods in computer science
a. Simple Markovian Predictors
For example, a Bayesian network could represent the probabilistic relationships between
diseases and symptoms. Given symptoms, the network can be used to compute the probabilities
of the presence of various diseases.
d. Neural predictors
Learning Vector
Quantisation Network (LVQ, T. Kohonen) and a Multi-Layer Perceptron
(MLP)
-32-
f. Meta-Prediction (again, see 4.)
6. X
a. Multi-Core and many-core architectures
i. Fundamental concepts: programming models
1. (shared address/ memory vs. message passing)
ii. Cache coherence problem
iii. Critical section concept
In concurrent programming, a critical section is a piece of code that accesses a shared resource
(data structure or device) that must not be concurrently accessed by more than one thread of
execution.[1] A critical section will usually terminate in fixed time, and a thread, task, or process
will have to wait for a fixed time to enter it (aka bounded waiting).
Some synchronization mechanism is required at the entry and exit of the critical section to ensure
exclusive use, for example a semaphore.
General scope:
● Only important concepts, no details! For example:
○ How does a perceptron learn?
○ What is an unbiased branch?
○ Why can problems happen in object oriented programming? (e.g. C++)
-33-