Beruflich Dokumente
Kultur Dokumente
net/publication/220422260
CITATIONS READS
29 30
7 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Josep Torrellas on 28 February 2015.
Architecture
minimize the chance of (parallel) pro-
gramming errors.
In this article, we describe a
for Improved
novel, general-purpose multicore
architecture—the Bulk Multicore—
we designed to enable a highly pro-
Programmability
grammable environment. In it, the
programmer and runtime system
are relieved of having to manage the
sharing of data thanks to novel sup-
port for scalable hardware cache co-
herence. Moreover, to help minimize
the chance of parallel-programming
errors, the Bulk Multicore provides
to the software high-performance se-
quential memory consistency and also
introduces several novel hardware
MULTICORE CHIPS AS commodity architecture primitives. These primitives can be
for platforms ranging from handhelds to used to build a sophisticated program-
development-and-debugging environ-
supercomputers herald an era when parallel ment, including low-overhead data-
programming and computing will be the norm. race detection, deterministic replay
While the computer science and engineering of parallel programs, and high-speed
disambiguation of sets of addresses.
community has periodically focused on advancing The primitives have an overhead low
the technology for parallel processing,8 this time enough to always be “on” during pro-
duction runs.
around the stakes are truly high, since there is The key idea in the Bulk Multi-
no obvious route to higher performance other core is twofold: First, the hardware
than through parallelism. However, for parallel automatically executes all software
as a series of atomic blocks of thou-
computing to become widespread, breakthroughs sands of dynamic instructions called
are needed in all layers of the computing stack, Chunks. Chunk execution is invisible
to the software and, therefore, puts no
including languages, programming models, restriction on the programming lan-
compilation and runtime software, programming guage or model. Second, the Bulk Mul-
and debugging tools, and hardware architectures. ticore introduces the use of Hardware
Address Signatures as a low-overhead
At the hardware-architecture layer, we need to mechanism to ensure atomic and iso-
change the way multicore architectures are designed. lated execution of chunks and help
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
14. ABSTRACT
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 8
unclassified unclassified unclassified Report (SAR)
ticore design where the compiler ob- tion. Having to provide such state in hardware-only mechanism, invisible
serves the chunks, the compiler can a multiprocessor environment—even to the software running on the pro-
further improve performance by heav- if no other processor or unit in the cessor. Moreover, its purpose is not to
ily optimizing the instructions within machine needs it—contributes to the parallelize a thread, since the chunks
each chunk. Finally, the Bulk Multi- complexity of current system designs. in a thread are not distributed to other
core organization decreases hardware This is because, in such an environ- processors. Rather, the purpose is to
D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 59
contributed articles
improve programmability and perfor- then the local chunk is squashed and addresses.
mance. must re-execute. In the Bulk Multicore, the hard-
Each chunk executes on the pro- To execute chunks atomically and ware automatically accumulates the
cessor atomically and in isolation. in isolation inexpensively, the Bulk addresses read and written by a chunk
Atomic execution means that none of Multicore introduces hardware ad- into a read (R) and a write (W) signa-
the chunk’s actions are made visible dress signatures.3 A signature is a ture, respectively. These signatures
to the rest of the system (processors or register of ≈1,024 bits that accumu- are kept in a module in the cache hi-
main memory) until the chunk com- lates hash-encoded addresses. Figure erarchy. This module also includes
pletes and commits. Execution in iso- 1 outlines a simple way to generate a simple functional units that operate
lation means that if the chunk reads a signature (see the sidebar “Signatures on signatures, performing such op-
location and (before it commits) a sec- and Signature Operations in Hard- erations as signature intersection (to
ond chunk in another processor that ware” for a deeper discussion). A sig- find the addresses common to two
has written to the location commits, nature, therefore, represents a set of signatures) and address membership
test (to find out whether an address
belongs to a signature), as detailed in
D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 61
contributed articles
D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 63
contributed articles
novel compiler optimizations that re- Figure 3. Parallel execution in the Bulk Multicore (a), with a possible
quire dynamic disambiguation of sets OrderOnly execution log (b) and PicoLog execution log (c).
of addresses (see the sidebar “Making
Signatures Visible to Software”).
Reduced Implementation
Complexity
The Bulk Multicore also has advan-
are statically specified in the code, group at the University of Illinois who Press, 2008, 289–300.
14. Musuvathi, M. and Qadeer, S. Iterative context
while chunks are created dynamically contributed through many discus- bounding for systematic testing of multithreaded
by the hardware. The second propos- sions, seminars, and brainstorming programs. In Proceedings of the Conference on
Programming Language Design and Implementation
al, called Implicit Transactions,19 is sessions. This work is supported by (San Diego, CA, June 10–13). ACM Press, New York,
a multiprocessor environment with the U.S. National Science Foundation, 2007, 446–455.
15. Narayanasamy, S., Pereira, C., and Calder, B.
checkpointed processors that regular- Defense Advanced Research Projects Recording shared memory dependencies using
ly take checkpoints. The instructions Agency, and Department of Energy and strata. In Proceedings of the International
Conference on Architectural Support for
executed between checkpoints consti- by Intel and Microsoft under the Uni- Programming Languages and Operating Systems
tute the equivalent of a chunk. No de- versal Parallel Computing Research (San Jose, CA, Oct. 21–25). ACM Press, New York,
2006, 229–240.
tailed implementation of the scheme Center, Sun Microsystems under the 16. Prvulovic, M. and Torrellas, J. ReEnact: Using
is presented. thread-level speculation mechanisms to debug data
University of Illinois OpenSPARC Cen- races in multithreaded codes. In Proceedings of the
Automatic Mutual Exclusion ter of Excellence, and IBM. International Symposium on Computer Architecture
(San Diego, CA, June 9–11). IEEE Press, 2003,
(AME)7 is a programming model in 110–121.
which a program is written as a group References
17. Sohi, G., Breach, S., and Vijayakumar, T. Multiscalar
processors. In Proceedings of the International
of atomic fragments that serialize in 1. Ahn, W., Qi, S., Lee, J.W., Nicolaides, M., Fang, X.,
Symposium on Computer Architecture (Santa
Torrellas, J., Wong, D., and Midkiff, S. BulkCompiler:
some manner. As in TCC, atomic sec- High-performance sequential consistency through
Margherita Ligure, Italy, June 22–24). ACM Press,
New York, 1995, 414–425.
tions in AME are statically specified cooperative compiler and hardware support. In
18. Tuck, J., Ahn, W., Ceze, L., and Torrellas, J. SoftSig:
Proceedings of the International Symposium on
in the code, while the Bulk Multicore Microarchitecture (New York City, Dec. 12–16). IEEE
Software-exposed hardware signatures for code
analysis and optimization. In Proceedings of the
chunks are hardware-generated dy- Press, 2009.
International Conference on Architectural Support
2. Ceze, L., Tuck, J., Montesinos, P., and Torrellas, J.
namic entities. BulkSC: Bulk enforcement of sequential consistency.
for Programming Languages and Operating Systems
(Seattle, WA, Mar. 1–5). ACM Press, New York, 2008,
The signature hardware we’ve in- In Proceedings of the International Symposium on
145–156.
Computer Architecture (San Diego, CA, June 9–13).
troduced here has been adapted for 19. Vallejo, E., Galluzzi, M., Cristal, A., Vallejo, F.,
ACM Press, New York, 2007, 278–289.
Beivide, R., Stenstrom, P., Smith, J.E., and Valero,
use in TM (such as in transaction- 3. Ceze, L., Tuck, J., Cascaval, C., and Torrellas,
M. Implementing kilo-instruction multiprocessors.
J. Bulk disambiguation of speculative threads
footprint collection and in address In Proceedings of the International Conference on
in multiprocessors. In Proceedings of the
Pervasive Services (Santorini, Greece, July 11–14).
disambiguation12,21). International Symposium on Computer Architecture
IEEE Press, 2005, 325–336.
(Boston, MA, June 17–21). IEEE Press, 2006,
Several proposals implement data- 20. Xu, M., Bodik, R., and Hill, M.D. A ‘flight data
227–238.
recorder’ for enabling full-system multiprocessor
race detection, deterministic replay of 4 Choi, J., Lee, K., Loginov, A., O’Callahan, R., Sarkar,
deterministic replay. In Proceedings of the
V., and Sridharan, M. Efficient and precise data-
International Symposium on Computer Architecture
multiprocessor programs, and other race detection for multithreaded object-oriented
(San Diego, CA, June 9–11). IEEE Press, 2003,
programs. In Proceedings of the Conference on
debugging techniques discussed here Programming Language Design and Implementation
122–133.
21. Yen, L., Bobba, J., Marty, M., Moore, K., Volos, H., Hill,
without operating in chunks.4,11,15,20 (Berlin, Germany, June 17-19). ACM Press, New
M., Swift, M., and Wood, D. LogTM-SE: Decoupling
York, 2002, 258–269.
Comparing their operation to chunk 5 Hammond, L., Wong, V., Chen, M., Carlstrom, B.D.,
hardware transactional memory from caches. In
Proceedings of the International Symposium on High
operation is the subject of future work. Davis, J.D., Hertzberg, B., Prabhu, M.K., Wijaya, H.,
Performance Computer Architecture (Phoenix, AZ,
Kozyrakis, C., and Olukotun, K. Transactional memory
Feb. 10–14). IEEE Press, 2007, 261–272.
coherence and consistency. In Proceedings of the
Future Directions International Symposium on Computer Architecture
(München, Germany, June 19–23). IEEE Press, 2004,
The Bulk Multicore architecture is a 102–113.
novel approach to building shared- 6. Herlihy M. and Moss, J.E.B. Transactional memory: Josep Torrellas (torrellas@cs.uiuc.edu) is a professor
Architectural support for lock-free data structures. and Willett Faculty Scholar in the Department of
memory multiprocessors, where the In Proceedings of the International Symposium on Computer Science at the University of Illinois at Urbana-
whole execution operates in atomic Computer Architecture (San Diego, CA, May 16–19). Champaign.
IEEE Press, 1993, 289–300.
chunks of instructions. This approach 7 Isard, M. and Birrell, A. Automatic mutual exclusion.
In Proceedings of the Workshop on Hot Topics Luis Ceze (luisceze@cs.washington.edu) is an assistant
can enable significant improvements professor in the Department of Computer Science and
in Operating Systems (San Diego, CA, May 7–9).
in the productivity of parallel pro- USENIX, 2007. Engineering at the University of Washington, Seattle, WA.
grammers while imposing no restric- 8. Kuck, D. Facing up to software’s greatest challenge:
Practical parallel processing. Computers in Physics
James Tuck (jtuck@ncsu.edu) is an assistant
tion on the programming model or 11, 3 (1997).
professor in the Department of Electrical and Computer
9. Lamport, L. How to make a multiprocessor computer
language used. that correctly executes multiprocess programs.
Engineering at North Carolina State University, Raleigh,
NC.
At the architecture level, we are ex- IEEE Transactions on Computers C-28, 9 (Sept.
1979), 690–691.
amining the scalability of this organi- 10. Lamport, L. Time, clocks, and the ordering of events Calin Cascaval (cascaval@us.ibm.com) is a research
zation. While chunk commit requires in a distributed system. Commun. ACM 21, 7 (July staff member and manager of programming models
1978), 558–565. and tools for scalable systems at the IBM T.J. Watson
arbitration in a (potentially distrib- 11. Lu, S., Tucek, J., Qin, F., and Zhou, Y. AVIO: Detecting Research Center, Yorktown Heights, NY.
uted) arbiter, the operation in chunks atomicity violations via access interleaving
invariants. In Proceedings of the International
is inherently latency tolerant. At the Conference on Architectural Support for Pablo Montesinos (pmontesi@samsung.com) is a staff
programming level, we are examin- Programming Languages and Operating Systems engineer in the Multicore Research Group at Samsung
(San Jose, CA, Oct. 21–25). ACM Press, New York, Information Systems America, San Jose, CA.
ing how chunk operation enables 2006, 37–48.
efficient support for new program- 12. Minh, C., Trautmann, M., Chung, J., McDonald, A.,
Wonsun Ahn (dahn2@uiuc.edu) is a graduate student in
Bronson, N., Casper, J., Kozyrakis, C., and Olukotun,
development and debugging tools, K. An effective hybrid transactional memory with the Department of Computer Science at the University of
strong isolation guarantees. In Proceedings of the Illinois at Urbana-Champaign.
aggressive autotuners and compilers,
International Symposium on Computer Architecture
and even novel programming models. (San Diego, CA, June 9–13). ACM Press, New York, Milos Prvulovic (milos@cc.gatech.edu) is an associate
2007, 69–80. professor in the School of Computer Science, College of
13. Montesinos, P., Ceze, L., and Torrellas, J. DeLorean: Computing, Georgia Institute of Technology, Atlanta, GA.
Acknowledgments Recording and deterministically replaying shared-
memory multiprocessor execution efficiently. In
We would like to thank the many pres- Proceedings of the International Symposium on
ent and past members of the I-acoma Computer Architecture (Beijing, June 21–25). IEEE © 2009 ACM 0001-0782/09/1200 $10.00
D ECE MB E R 2 0 0 9 | VO L. 52 | N O. 1 2 | C OM M U N IC AT I O N S OF T HE ACM 65