Sie sind auf Seite 1von 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220422260

The Bulk Multicore Architecture for Improved Programmability

Article  in  Communications of the ACM · December 2009


DOI: 10.1145/1610252.1610271 · Source: DBLP

CITATIONS READS

29 30

7 authors, including:

Josep Torrellas Luis Ceze


University of Illinois, Urbana-Champaign University of Washington Seattle
379 PUBLICATIONS   9,291 CITATIONS    217 PUBLICATIONS   6,097 CITATIONS   

SEE PROFILE SEE PROFILE

James Tuck Calin Cascaval


North Carolina State University Barefoot Networks
57 PUBLICATIONS   1,379 CITATIONS    99 PUBLICATIONS   2,525 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Grappa View project

DNA Data Storage View project

All content following this page was uploaded by Josep Torrellas on 28 February 2015.

The user has requested enhancement of the downloaded file.


contributed articles
DOI:10.1145/ 1610252.1610271
In the past, architectures were de-
Easing the programmer’s burden does not signed primarily for performance or
for energy efficiency. Looking ahead,
compromise system performance or increase one of the top priorities must be for
the complexity of hardware implementation. the architecture to enable a program-
mable environment. In practice, pro-
BY JOSEP TORRELLAS, LUIS CEZE, JAMES TUCK, grammability is a notoriously difficult
CALIN CASCAVAL, PABLO MONTESINOS, WONSUN AHN, metric to define and measure. At the
AND MILOS PRVULOVIC hardware-architecture level, program-
mability implies two things: First, the
architecture is able to attain high ef-

The Bulk Multicore


ficiency while relieving the program-
mer from having to manage low-level
tasks; second, the architecture helps

Architecture
minimize the chance of (parallel) pro-
gramming errors.
In this article, we describe a

for Improved
novel, general-purpose multicore
architecture—the Bulk Multicore—
we designed to enable a highly pro-

Programmability
grammable environment. In it, the
programmer and runtime system
are relieved of having to manage the
sharing of data thanks to novel sup-
port for scalable hardware cache co-
herence. Moreover, to help minimize
the chance of parallel-programming
errors, the Bulk Multicore provides
to the software high-performance se-
quential memory consistency and also
introduces several novel hardware
MULTICORE CHIPS AS commodity architecture primitives. These primitives can be
for platforms ranging from handhelds to used to build a sophisticated program-
development-and-debugging environ-
supercomputers herald an era when parallel ment, including low-overhead data-
programming and computing will be the norm. race detection, deterministic replay
While the computer science and engineering of parallel programs, and high-speed
disambiguation of sets of addresses.
community has periodically focused on advancing The primitives have an overhead low
the technology for parallel processing,8 this time enough to always be “on” during pro-
duction runs.
around the stakes are truly high, since there is The key idea in the Bulk Multi-
no obvious route to higher performance other core is twofold: First, the hardware
than through parallelism. However, for parallel automatically executes all software
as a series of atomic blocks of thou-
computing to become widespread, breakthroughs sands of dynamic instructions called
are needed in all layers of the computing stack, Chunks. Chunk execution is invisible
to the software and, therefore, puts no
including languages, programming models, restriction on the programming lan-
compilation and runtime software, programming guage or model. Second, the Bulk Mul-
and debugging tools, and hardware architectures. ticore introduces the use of Hardware
Address Signatures as a low-overhead
At the hardware-architecture layer, we need to mechanism to ensure atomic and iso-
change the way multicore architectures are designed. lated execution of chunks and help

58 C OM MUNICATIONS O F TH E AC M | D EC EM BER 2009 | VO L . 5 2 | N O. 1 2


Form Approved
Report Documentation Page OMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.

1. REPORT DATE 3. DATES COVERED


2. REPORT TYPE
DEC 2009 00-00-2009 to 00-00-2009
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
The Bulk Multicore Architecture for Improved Programmability 5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION


REPORT NUMBER
University of Illinois at Urbana-Champaign,Department of Computer
Science,201 N. Goodwin Ave,Urbana,IL,61801-2302
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT


NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT


Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 8
unclassified unclassified unclassified Report (SAR)

Standard Form 298 (Rev. 8-98)


Prescribed by ANSI Std Z39-18
maintain hardware cache coherence. design complexity by freeing proces- ment, memory-system accesses take
The programmability advantages of sor designers from having to worry many cycles, and multiple loads and
the Bulk Multicore do not come at the about many corner cases that appear stores from both the same and dif-
expense of performance. On the con- when designing multiprocessors. ferent processors overlap their execu-
trary, the Bulk Multicore enables high tion.
performance because the processor Architecture In the Bulk Multicore, the default
hardware is free to aggressively reor- The Bulk Multicore architecture elim- execution mode of a processor is to
der and overlap the memory accesses inates one of the traditional tenets of commit chunks of instructions at a
of a program within chunks without processor architecture, namely the time.2 A chunk is a group of dynami-
risk of breaking their expected behav- need to commit instructions in order, cally contiguous instructions (such as
ior in a multiprocessor environment. providing the architectural state of the 2,000 instructions). Such a “chunked”
Moreover, in an advanced Bulk Mul- processor after every single instruc- mode of execution and commit is a
ILLUSTRATION BY A NDY GILMORE

ticore design where the compiler ob- tion. Having to provide such state in hardware-only mechanism, invisible
serves the chunks, the compiler can a multiprocessor environment—even to the software running on the pro-
further improve performance by heav- if no other processor or unit in the cessor. Moreover, its purpose is not to
ily optimizing the instructions within machine needs it—contributes to the parallelize a thread, since the chunks
each chunk. Finally, the Bulk Multi- complexity of current system designs. in a thread are not distributed to other
core organization decreases hardware This is because, in such an environ- processors. Rather, the purpose is to

D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 59
contributed articles

improve programmability and perfor- then the local chunk is squashed and addresses.
mance. must re-execute. In the Bulk Multicore, the hard-
Each chunk executes on the pro- To execute chunks atomically and ware automatically accumulates the
cessor atomically and in isolation. in isolation inexpensively, the Bulk addresses read and written by a chunk
Atomic execution means that none of Multicore introduces hardware ad- into a read (R) and a write (W) signa-
the chunk’s actions are made visible dress signatures.3 A signature is a ture, respectively. These signatures
to the rest of the system (processors or register of ≈1,024 bits that accumu- are kept in a module in the cache hi-
main memory) until the chunk com- lates hash-encoded addresses. Figure erarchy. This module also includes
pletes and commits. Execution in iso- 1 outlines a simple way to generate a simple functional units that operate
lation means that if the chunk reads a signature (see the sidebar “Signatures on signatures, performing such op-
location and (before it commits) a sec- and Signature Operations in Hard- erations as signature intersection (to
ond chunk in another processor that ware” for a deeper discussion). A sig- find the addresses common to two
has written to the location commits, nature, therefore, represents a set of signatures) and address membership
test (to find out whether an address
belongs to a signature), as detailed in

Signatures and Signature the sidebar.


Atomic chunk execution is sup-

Operations in Hardware ported by buffering the state gener-


ated by the chunk in the L1 cache.
Figure 1 in the main text shows a simple implementation of a signature. The bits of an No update is propagated outside the
incoming address go through a fixed permutation to reduce collisions and are then cache while the chunk is executing.
separated in bit-fields Ci. Each field is decoded and accumulated into a bit-field Vj in the When the chunk completes or when a
signature. Much more sophisticated implementations are also possible.
A module called the Bulk Disambiguation Module contains several signature dirty cache line with address in the W
registers and simple functional units that operate efficiently on signatures. These signature must be displaced from the
functional units are invisible to the instruction-set architecture. Note that, given a cache, the hardware proceeds to com-
signature, we can recover only a superset of the addresses originally encoded into the mit the chunk. A successful commit
signature. Consequently, the operations on signatures produce conservative results.
The figure here outlines five signature functional units: intersection, union, test involves sending the chunk’s W sig-
for null signature, test for address membership, and decoding (δ). Intersection finds nature to the subset of sharer proces-
the addresses common to two signatures by performing a bit-wise AND of the two sors indicated by the directory2 and
signatures. The resulting signature is empty if, as shown in the figure, any of its bit-
fields contains all zeros. Union finds all addresses present in at least one signature
clearing the local R and W signatures.
through a bit-wise OR of the two signatures. Testing whether an address a is present The latter operation erases any record
(conservatively) in a signature involves encoding a into a signature, intersecting the of the updates made by the chunk,
latter with the original signature and then testing the result for a null signature. though the written lines remain dirty
Decoding (δ) a signature determines which cache sets can contain addresses
belonging to the signature. The set bitmask produced by this operation is then passed in the cache.
to a finite-state machine that successively reads individual lines from the sets in the The W signature carries enough
bitmask and checks for membership to the signature. This process is used to identify information to both invalidate stale
and invalidate all the addresses in a signature that are present in the cache.
Overall, the support described here enables low-overhead operations on sets of
lines from the other coherent caches
addresses.3 (using the δ signature operation on W,
as discussed in the sidebar) and en-
force that all other processors execute
Operations on signatures.
their chunks in isolation. Specifically,
to enforce that a processor executes a
 
chunk in isolation when the processor
receives an incoming signature Winc,
 its hardware intersects Winc against
 the local Rloc and Wloc signatures. If any
of the two intersections is not null, it
  means (conservatively) that the local
ˆ 
 !" chunk has accessed a data element

written by the committing chunk.
  

 Consequently, the local chunk is
 squashed and then restarted.


! Figure 2 outlines atomic and iso-


 ! 
# lated execution. Thread 0 executes
 a chunk that writes variables B and
‰  #
   C, and no invalidations are sent out.
Signature W0 receives the hashed ad-
dresses of B and C. At the same time,
Thread 1 issues reads for B and C,
which (by construction) load the non-

60 C OM MUNICATIONS O F TH E AC M | D EC EM BER 2009 | VO L . 5 2 | N O. 1 2


contributed articles

speculative values of the variables— Figure 1. A simple way to generate a signature.


namely, the values before Thread 0’s
updates. When Thread 0’s chunk com-
  
mits, the hardware sends signature W0
to Thread 1, and W0 and R0 are cleared.
At the processor where Thread 1 runs,
 
the hardware intersects W0 with the
ongoing chunk’s R1 and W1. Since W0
∩ R1 is not null, the chunk in Thread 1
   
is squashed.
The commit of chunks is serial-
...
ized globally. In a bus-based machine,
serialization is given by the order in
which W signatures are placed on the
bus. With a general interconnect, seri-
     

alization is enforced by a (potentially
distributed) arbiter module.2 W sig-
natures are sent to the arbiter, which
quickly acknowledges whether the
chunk can be considered committed.
Since chunks execute atomically Figure 2. Executing chunks atomically and in isolation with signatures.
and in isolation, commit in program
order in each processor, and there is    
a global commit order of chunks, the
Bulk Multicore supports sequential
consistency (SC)9 at the chunk level.
As a consequence, the machine also ld X

 BC st B   ld B
supports SC at the instruction level.  XY st C st T
 T
More important, it supports high- ld Y ld C  BC


performance SC at low hardware com-
plexity. 
ˆ ›
ˆ

The performance of this SC imple-
mentation is high because (within
a chunk) the Bulk Multicore allows
memory access reordering and over-
lap and instruction optimization. As coherence event. Consequently, load or runtime system. However, it does
we discuss later, synchronization in- queues, L1 caches, and other critical enable a highly programmable envi-
structions induce no reordering con- processor components must be aug- ronment by virtue of providing two
straint within a chunk. mented with extra hardware. features: high-performance SC at the
Meanwhile, hardware-implementa- In the Bulk Multicore, SC enforce- hardware level and several novel hard-
tion complexity is low because memo- ment and violation detection are per- ware primitives that can be used to
ry-consistency enforcement is largely formed with simple signature inter- build a sophisticated program-devel-
decoupled from processor structures. sections outside the processor core. opment-and-debugging environment.
In a conventional processor that is- Additionally, caches are oblivious to Unlike current architectures, the
sues memory accesses out of order, what data is speculative, and their tag Bulk Multicore supports high-per-
supporting SC requires intrusive pro- and data arrays are unmodified. formance SC at the hardware level.
cessor modifications. For example, Finally, note that the Bulk Mul- If we generate code for the Bulk Mul-
from the time the processor executes ticore’s execution mode is not like ticore using an SC compiler (such as
a load to line L out of order until the transactional memory.6 While one the BulkCompiler1), we attain a high-
load reaches its commit time, the could intuitively view the Bulk Multi- performance, fully SC platform. The
hardware must check for writes to L core as an environment with transac- resulting platform is highly program-
by other processors—in case an in- tions occurring all the time, the key mable for several reasons. The first is
consistent state was observed. Such difference is that chunks are dynamic that debugging concurrent programs
checking typically requires sending, entities, rather than static, and invis- with data races would be much easier.
for each external coherence event, a ible to the software. This is because the possible outcomes
signal up the cache hierarchy. The sig- of the memory accesses involved in
nal snoops the load queue to check for High Programmability the bug would be easier to reason
an address match. Additional modifi- Since chunked execution is invisible about, and the debugger would in
cations involve preventing cache dis- to the software, it places no restriction fact be able to reproduce the buggy
placements that could risk missing a on programming model, language, interleaving. Second, most existing

D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 61
contributed articles

software correctness tools (such as order of data dependences observed


Microsoft’s CHESS14) assume SC. Veri- among the multiple threads. The log
fying software correctness under SC is effectively captures the “interleaving”
already difficult, and the state space of the program’s threads. Then, in the
balloons if non-SC interleavings need
to be verified as well. In the next few The Bulk Multicore replay step, while the parallel program
is re-executed, the system enforces
years, we expect that correctness-veri-
fication tools will play a larger role as
supports the interleaving orders encoded in the
log.
more parallel software is developed. high-performance In most proposals of determinis-
Using them in combination with an
SC platform would make them most
sequential memory tic replay schemes, the log stores in-
dividual data dependences between
effective. consistency at threads or groups of dependences
A final reason for the program-
mability of an SC platform is that it
low hardware bundled together. In the Bulk Multi-
core, the log must store only the total
would make the memory model of complexity. order of chunk commits, an approach
safe languages (such as Java) easier we call DeLorean.13 The logged infor-
to understand and verify. The need to mation can be as minimalist as a list
provide safety guarantees and enable of committing-processor IDs, assum-
performance at the same time has re- ing the chunking is performed in a
sulted in an increasingly complex and deterministic manner; therefore, the
unintuitive memory model over the chunk sizes can be deterministically
years. A high-performance SC memo- reproduced on replay. This design,
ry model would trivially ensure Java’s which we call OrderOnly, reduces the
safety properties related to memory log size by nearly an order of magni-
ordering, improving its security and tude over previous proposals.
usability. The Bulk Multicore can further re-
The Bulk Multicore’s second fea- duce the log size if, during the record-
ture is a set of hardware primitives ing step, the arbiter enforces a certain
that can be used to engineer a sophis- order of chunk commit interleaving
ticated program-development-and- among the different threads (such as
debugging environment that is always by committing one chunk from each
“on,” even during production runs. processor round robin). In this case
The key insight is that chunks and of enforced chunk-commit order, the
signatures free development and de- log practically disappears. During the
bugging tools from having to record replay step, the arbiter reinforces the
or be concerned with individual loads original commit algorithm, forcing
and stores. As a result, the amount of the same order of chunk commits as
bookkeeping and state required by in the recording step. This design,
the tools is substantially reduced, as which we call PicoLog, typically incurs
is the time overhead. Here, we give a performance cost because it can
three examples of this benefit in the force some processors to wait during
areas of deterministic replay of paral- recording.
lel programs, data-race detection, and Figure 3a outlines a parallel execu-
high-speed disambiguation of sets of tion in which the boxes are chunks
addresses. and the arrows are the observed cross-
Note, too, that chunks provide an thread data dependences. Figure 3b
excellent primitive for supporting shows a possible resulting execution
popular atomic-section-based tech- log in OrderOnly, while Figure 3c
niques for programmability (such as shows the log in PicoLog.
thread-level speculation17 and trans- Data-race detection at production-
actional memory6). run speed. The Bulk Multicore can
Deterministic replay of parallel pro- support an efficient data-race detec-
grams with practically no log. Hard- tor based on the “happens-before”
ware-assisted deterministic replay method10 if it cuts the chunks at syn-
of parallel programs is a promising chronization points, rather than at
technique for debugging parallel arbitrary dynamic points. Synchroni-
programs. It involves a two-step pro- zation points are easily recognized by
cess.20 In the recording step, while hardware or software, since synchro-
the parallel program executes, spe- nization operations are executed by
cial hardware records into a log the special instructions. This approach

62 C OMM UNICATIONS O F TH E AC M | D EC EM BER 2009 | VO L . 5 2 | N O. 1 2


contributed articles

is described in ReEnact16; Figure 4 in-


cludes examples with a lock, flag, and
barrier.
Each chunk is given a counter
Making Signatures
value called ChunkID following the
happens-before ordering. Specifi-
Visible to Software
cally, chunks in a given thread receive We propose that the software interact with some additional signatures through three
main primitives:18
ChunkIDs that increase in program The first is to explicitly encode into a signature either one address (Figure 1a) or all
order. Moreover, a synchroniza- addresses accessed in a code region (Figure 1b). The latter is enabled by the bcollect
tion between two threads orders the (begin collect) and ecollect (end collect) instructions, which can be set to collect only
reads, only writes, or both.
ChunkIDs of the chunks involved in
The second primitive is to disambiguate the addresses accessed by the processor
the synchronization. For example, in in a code region against a given signature. It is enabled by the bdisamb.loc (begin
Figure 4a, the chunk in Thread 2 fol- disambiguate local) and edisamb.loc (end disambiguate local) instructions (Figure 1c),
lowing the lock acquire (Chunk 5) and can disambiguate reads, writes, or both.
The third primitive is to disambiguate the addresses of incoming coherence
sets its ChunkID to be a successor of messages (invalidations or downgrades) against a given local signature. It is enabled
both the previous chunk in Thread 2 by the bdisamb.rem (begin disambiguate remote) and edisamb.rem (end disambiguate
(Chunk 4) and the chunk in Thread 1 remote) instructions (Figure 1d) and can disambiguate reads, writes, or both. When
that released the lock (Chunk 2). For disambiguation finds a match, the system can deliver an interrupt or set a bit.
Figure 2 includes three examples of what can be done with these primitives: Figure
the other synchronization primitives, 2a shows how the machine inexpensively supports many watchpoints. The processor
the algorithm is similar. For exam- encodes into signature Sig2 the address of variable y and all the addresses accessed in
ple, for the barrier in Figure 4c, each function foo(). It then watches all these addresses by executing bdisamb.loc on Sig2.
Figure 2b shows how a second call to a function that reads and writes memory in
chunk immediately following the bar- its body can be skipped. In the figure, the code calls function foo() twice with the same
rier is given a ChunkID that makes it a input value of x. To see if the second call can be skipped, the program first collects
successor of all the chunks leading to all addresses accessed by foo() in Sig2. It then disambiguates all subsequent accesses
the barrier. against Sig2. When execution reaches the second call to foo(), it can skip the call if two
conditions hold: the first is that the disambiguation did not find a conflict; the second
Using ChunkIDs, we’ve given a (not shown in the figure) is that the read and write footprints of the first foo() call do not
partial ordering to the chunks. For overlap. This possible overlap is checked by separately collecting the addresses read
example, in Figure 4a, Chunks 1 and in foo() and those written in foo() in separate signatures and intersecting the resulting
signatures.
6 are ordered, but Chunks 3 and 4 are
Finally, Figure 2c shows a way to detect data dependences between threads running
not. Such ordering helps detect data on different processors. In the figure, collect encodes all addresses accessed in a
races that occur in a particular execu- code section into Sig2. Surrounding the collect instructions, the code places disamb.
tion. Specifically, when two chunks rem instructions to monitor if any remotely initiated coherence-action conflicts with
addresses accessed locally. To disregard read-read conflicts, the programmer can
from different threads are found to collect the reads in a separate signature and perform remote disambiguation of only
have a data-dependence at runtime, writes against that signature.
their two ChunkIDs are compared. If
the ChunkIDs are ordered, this is not Figure 1. Primitives enabling software to interact with additional signatures:
collection (a and b), local disambiguation (c), and remote disambiguation (d).
a data race because there is an inter-
vening synchronization between the
chunks. Otherwise, a data race has
 
 
 
 
 
been found.   
A simple way to determine when   
two chunks have a data-dependence        
   
 
is to use the Bulk Multicore signa-    
tures to tell when the data footprints
of two chunks overlap. This opera-
tion, together with the comparison Figure 2. Using signatures to support data watchpoints (a), skip execution of
functions (b), and detect data dependencies between threads running on
and maintenance of ChunkIDs, can
different processors (c).
be done with low overhead with hard-
ware support. Consequently, the Bulk
  
Multicore can detect data races with-

out significantly slowing the program,    
making it ideal for debugging produc-    
 
   
   
  
tion runs.  
    
Enhancing programmability by mak-  
     
   
ing signatures visible to software. Final-  

      
ly, a technique that improves program-      
mability further is to make additional    
  
signatures visible to the software. This   
support enables inexpensive monitor-
ing of memory accesses, as well as

D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 63
contributed articles

novel compiler optimizations that re- Figure 3. Parallel execution in the Bulk Multicore (a), with a possible
quire dynamic disambiguation of sets OrderOnly execution log (b) and PicoLog execution log (c).
of addresses (see the sidebar “Making
Signatures Visible to Software”).             

Reduced Implementation 
Complexity 
The Bulk Multicore also has advan- 
 

tages in performance and in hardware 


simplicity. It delivers high perfor- 
 
 
mance because the processor hard- 
ware can reorder and overlap all mem-     

ory accesses within a chunk—except,
of course, those that participate in 


  
    
single-thread dependences. In partic-
ular, in the Bulk Multicore, synchroni-
zation instructions do not constrain
memory access reordering or overlap. Figure 4. Forming chunks for data-race detection in the presence
of a lock (a), flag (b), and barrier (c).
Indeed, fences inside a chunk are
transformed into null instructions.
Fences’ traditional functionality of 
 
 
 
 
 

delaying execution until certain ref-
 
erences are performed is useless; by
construction, no other processor ob-

  
serves the actual order of instruction 
     
execution within a chunk. 
Moreover, a processor can concur-
rently execute multiple chunks from          
the same thread, and memory access- 

es from these chunks can also overlap.
Each concurrently executing chunk       
in the processor has its own R and W 
signatures, and individual accesses
update the corresponding chunk’s
signatures. As long as chunks within 

     
a processor commit in program order
(if a chunk is squashed, its succes-
sors are also squashed), correctness is
guaranteed. Such concurrent chunk which cores and accelerators are de- and undo and detection of cross-
execution in a processor hides the signed without concern for how to sat- thread conflicts. However, they also
chunk-commit overhead. isfy a particular set of access-ordering have a different goal, namely simplify
Bulk Multicore performance in- constraints. This ability allows hard- code parallelization by parallelizing
creases further if the compiler gener- ware designers to focus on the novel the code transparently to the user
ates the chunks, as in the BulkCom- aspects of their design, rather than software in TLS or by annotating the
piler.1 In this case, the compiler can on the interaction with the target ma- user code with constructs for mutual
aggressively optimize the code within chine’s legacy memory-consistency exclusion in TM. On the other hand,
each chunk, recognizing that no other model. It also motivates the develop- the Bulk Multicore aims to provide a
processor sees intermediate states ment of commodity accelerators. broadly usable architectural platform
within a chunk. that is easier to program for while de-
Finally, the Bulk Multicore needs Related Work livering advantages in performance
simpler processor hardware than cur- Numerous proposals for multipro- and hardware simplicity.
rent machines. As discussed earlier, cessor architecture designs focus on Two architecture proposals in-
much of the responsibility for mem- improving programmability. In par- volve processors continuously execut-
ory-consistency enforcement is taken ticular, architectures for thread-level ing blocks of instructions atomically
away from critical structures in the speculation (TLS)17 and transactional and in isolation. One of them, called
core (such as the load queue and L1 memory (TM)6 have received signifi- Transactional Memory Coherence and
cache) and moved to the cache hierar- cant attention over the past 15 years. Consistency (TCC),5 is a TM environ-
chy where signatures detect violations These techniques share key primitive ment with transactions occurring all
of SC.2 For example, this property mechanisms with the Bulk Multicore, the time. TCC mainly differs from the
could enable a new environment in notably speculative state buffering Bulk Multicore in that its transactions

64 C OMM UNICATIONS O F THE AC M | D EC EM BER 2009 | VO L . 5 2 | N O. 1 2


contributed articles

are statically specified in the code, group at the University of Illinois who Press, 2008, 289–300.
14. Musuvathi, M. and Qadeer, S. Iterative context
while chunks are created dynamically contributed through many discus- bounding for systematic testing of multithreaded
by the hardware. The second propos- sions, seminars, and brainstorming programs. In Proceedings of the Conference on
Programming Language Design and Implementation
al, called Implicit Transactions,19 is sessions. This work is supported by (San Diego, CA, June 10–13). ACM Press, New York,
a multiprocessor environment with the U.S. National Science Foundation, 2007, 446–455.
15. Narayanasamy, S., Pereira, C., and Calder, B.
checkpointed processors that regular- Defense Advanced Research Projects Recording shared memory dependencies using
ly take checkpoints. The instructions Agency, and Department of Energy and strata. In Proceedings of the International
Conference on Architectural Support for
executed between checkpoints consti- by Intel and Microsoft under the Uni- Programming Languages and Operating Systems
tute the equivalent of a chunk. No de- versal Parallel Computing Research (San Jose, CA, Oct. 21–25). ACM Press, New York,
2006, 229–240.
tailed implementation of the scheme Center, Sun Microsystems under the 16. Prvulovic, M. and Torrellas, J. ReEnact: Using
is presented. thread-level speculation mechanisms to debug data
University of Illinois OpenSPARC Cen- races in multithreaded codes. In Proceedings of the
Automatic Mutual Exclusion ter of Excellence, and IBM. International Symposium on Computer Architecture
(San Diego, CA, June 9–11). IEEE Press, 2003,
(AME)7 is a programming model in 110–121.
which a program is written as a group References
17. Sohi, G., Breach, S., and Vijayakumar, T. Multiscalar
processors. In Proceedings of the International
of atomic fragments that serialize in 1. Ahn, W., Qi, S., Lee, J.W., Nicolaides, M., Fang, X.,
Symposium on Computer Architecture (Santa
Torrellas, J., Wong, D., and Midkiff, S. BulkCompiler:
some manner. As in TCC, atomic sec- High-performance sequential consistency through
Margherita Ligure, Italy, June 22–24). ACM Press,
New York, 1995, 414–425.
tions in AME are statically specified cooperative compiler and hardware support. In
18. Tuck, J., Ahn, W., Ceze, L., and Torrellas, J. SoftSig:
Proceedings of the International Symposium on
in the code, while the Bulk Multicore Microarchitecture (New York City, Dec. 12–16). IEEE
Software-exposed hardware signatures for code
analysis and optimization. In Proceedings of the
chunks are hardware-generated dy- Press, 2009.
International Conference on Architectural Support
2. Ceze, L., Tuck, J., Montesinos, P., and Torrellas, J.
namic entities. BulkSC: Bulk enforcement of sequential consistency.
for Programming Languages and Operating Systems
(Seattle, WA, Mar. 1–5). ACM Press, New York, 2008,
The signature hardware we’ve in- In Proceedings of the International Symposium on
145–156.
Computer Architecture (San Diego, CA, June 9–13).
troduced here has been adapted for 19. Vallejo, E., Galluzzi, M., Cristal, A., Vallejo, F.,
ACM Press, New York, 2007, 278–289.
Beivide, R., Stenstrom, P., Smith, J.E., and Valero,
use in TM (such as in transaction- 3. Ceze, L., Tuck, J., Cascaval, C., and Torrellas,
M. Implementing kilo-instruction multiprocessors.
J. Bulk disambiguation of speculative threads
footprint collection and in address In Proceedings of the International Conference on
in multiprocessors. In Proceedings of the
Pervasive Services (Santorini, Greece, July 11–14).
disambiguation12,21). International Symposium on Computer Architecture
IEEE Press, 2005, 325–336.
(Boston, MA, June 17–21). IEEE Press, 2006,
Several proposals implement data- 20. Xu, M., Bodik, R., and Hill, M.D. A ‘flight data
227–238.
recorder’ for enabling full-system multiprocessor
race detection, deterministic replay of 4 Choi, J., Lee, K., Loginov, A., O’Callahan, R., Sarkar,
deterministic replay. In Proceedings of the
V., and Sridharan, M. Efficient and precise data-
International Symposium on Computer Architecture
multiprocessor programs, and other race detection for multithreaded object-oriented
(San Diego, CA, June 9–11). IEEE Press, 2003,
programs. In Proceedings of the Conference on
debugging techniques discussed here Programming Language Design and Implementation
122–133.
21. Yen, L., Bobba, J., Marty, M., Moore, K., Volos, H., Hill,
without operating in chunks.4,11,15,20 (Berlin, Germany, June 17-19). ACM Press, New
M., Swift, M., and Wood, D. LogTM-SE: Decoupling
York, 2002, 258–269.
Comparing their operation to chunk 5 Hammond, L., Wong, V., Chen, M., Carlstrom, B.D.,
hardware transactional memory from caches. In
Proceedings of the International Symposium on High
operation is the subject of future work. Davis, J.D., Hertzberg, B., Prabhu, M.K., Wijaya, H.,
Performance Computer Architecture (Phoenix, AZ,
Kozyrakis, C., and Olukotun, K. Transactional memory
Feb. 10–14). IEEE Press, 2007, 261–272.
coherence and consistency. In Proceedings of the
Future Directions International Symposium on Computer Architecture
(München, Germany, June 19–23). IEEE Press, 2004,
The Bulk Multicore architecture is a 102–113.
novel approach to building shared- 6. Herlihy M. and Moss, J.E.B. Transactional memory: Josep Torrellas (torrellas@cs.uiuc.edu) is a professor
Architectural support for lock-free data structures. and Willett Faculty Scholar in the Department of
memory multiprocessors, where the In Proceedings of the International Symposium on Computer Science at the University of Illinois at Urbana-
whole execution operates in atomic Computer Architecture (San Diego, CA, May 16–19). Champaign.
IEEE Press, 1993, 289–300.
chunks of instructions. This approach 7 Isard, M. and Birrell, A. Automatic mutual exclusion.
In Proceedings of the Workshop on Hot Topics Luis Ceze (luisceze@cs.washington.edu) is an assistant
can enable significant improvements professor in the Department of Computer Science and
in Operating Systems (San Diego, CA, May 7–9).
in the productivity of parallel pro- USENIX, 2007. Engineering at the University of Washington, Seattle, WA.
grammers while imposing no restric- 8. Kuck, D. Facing up to software’s greatest challenge:
Practical parallel processing. Computers in Physics
James Tuck (jtuck@ncsu.edu) is an assistant
tion on the programming model or 11, 3 (1997).
professor in the Department of Electrical and Computer
9. Lamport, L. How to make a multiprocessor computer
language used. that correctly executes multiprocess programs.
Engineering at North Carolina State University, Raleigh,
NC.
At the architecture level, we are ex- IEEE Transactions on Computers C-28, 9 (Sept.
1979), 690–691.
amining the scalability of this organi- 10. Lamport, L. Time, clocks, and the ordering of events Calin Cascaval (cascaval@us.ibm.com) is a research
zation. While chunk commit requires in a distributed system. Commun. ACM 21, 7 (July staff member and manager of programming models
1978), 558–565. and tools for scalable systems at the IBM T.J. Watson
arbitration in a (potentially distrib- 11. Lu, S., Tucek, J., Qin, F., and Zhou, Y. AVIO: Detecting Research Center, Yorktown Heights, NY.
uted) arbiter, the operation in chunks atomicity violations via access interleaving
invariants. In Proceedings of the International
is inherently latency tolerant. At the Conference on Architectural Support for Pablo Montesinos (pmontesi@samsung.com) is a staff
programming level, we are examin- Programming Languages and Operating Systems engineer in the Multicore Research Group at Samsung
(San Jose, CA, Oct. 21–25). ACM Press, New York, Information Systems America, San Jose, CA.
ing how chunk operation enables 2006, 37–48.
efficient support for new program- 12. Minh, C., Trautmann, M., Chung, J., McDonald, A.,
Wonsun Ahn (dahn2@uiuc.edu) is a graduate student in
Bronson, N., Casper, J., Kozyrakis, C., and Olukotun,
development and debugging tools, K. An effective hybrid transactional memory with the Department of Computer Science at the University of
strong isolation guarantees. In Proceedings of the Illinois at Urbana-Champaign.
aggressive autotuners and compilers,
International Symposium on Computer Architecture
and even novel programming models. (San Diego, CA, June 9–13). ACM Press, New York, Milos Prvulovic (milos@cc.gatech.edu) is an associate
2007, 69–80. professor in the School of Computer Science, College of
13. Montesinos, P., Ceze, L., and Torrellas, J. DeLorean: Computing, Georgia Institute of Technology, Atlanta, GA.
Acknowledgments Recording and deterministically replaying shared-
memory multiprocessor execution efficiently. In
We would like to thank the many pres- Proceedings of the International Symposium on
ent and past members of the I-acoma Computer Architecture (Beijing, June 21–25). IEEE © 2009 ACM 0001-0782/09/1200 $10.00

D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 65

View publication stats

Das könnte Ihnen auch gefallen