Sie sind auf Seite 1von 154

Front cover

Blue Gene/L: Performance Analysis Tools


Learn about Blue Gene/L performance tooling Discover the details about PAPI and the External Performance Monitor Understand the pros and cons of the different tools

Gary Mullen-Schultz

ibm.com/redbooks

International Technical Support Organization Blue Gene/L: Performance Analysis Tools July 2006

SG24-7278-00

Note: Before using this information and the product it supports, read the information in Notices on page v.

First Edition (July 2006) This edition applies to Version 1, Release 3, Modification 1 of Blue Gene/L (product number 5733-BG1).

Copyright International Business Machines Corporation 2006. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1. Performance guidelines and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Tooling overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 IBM High Performance Computing Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 General performance testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Overview of the tools that are available on System p . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Overview of tools ported to Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Message passing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 MPI Tracer and Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Hardware performance monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Xprofiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 I/O performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Modular I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Visualization and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 PeekPerf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 MASS and MASSV libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. Comparison of performance tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 External Performance Instrumentation Facility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Performance Application Programming Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Summary comparison of Perfmon and PAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3. External Performance Instrumentation Facility . . . . . . . . . . . . . . . . . . . . . . 3.1 Overview of EPIF and Perfmon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Goals and strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 EPIF commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 perfmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 dsp_perfmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 ext_perfmon_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 exp_perfmon_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 imp_perfmon_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 end_perfmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Building the necessary Python packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Typical command uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Options for EPIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Options for ext_perfmon_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 dsp_perfmon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 exp_perfmon_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 imp_perfmon_data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 2 2 3 3 3 4 4 5 5 5 5 5 6 7 8 8 9

11 12 12 13 13 17 17 21 39 43 43 43 43 44 44 49 54 54 54 iii

Copyright IBM Corp. 2006. All rights reserved.

Chapter 4. Performance Application Programming Interface . . . . . . . . . . . . . . . . . . . . 4.1 PAPI implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 The linux-bgl PAPI substrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 PAPI event mapping for Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Modifications to PAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Examples of using hardware performance monitor libraries for Blue Gene/L . . . . . . . . 4.2.1 PAPI library usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 bgl_perfctr usage example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 56 56 56 58 58 58 66 75

Appendix A. Statement of completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Appendix B. Electromagnetic compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Appendix C. Perfmon database table specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . Database organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance collection instance table: BGLPERFINST. . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance definition table: BGLPERFDEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance description table: BGLPERFDESC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance job table: BGLPERFJOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance location table: BGLPERFLOCATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performances samples definition table: BGLPERFSAMPLES. . . . . . . . . . . . . . . . . . . . . . . Performance data file table: BGLPERFDATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BGLPERFDESC table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BGLPERFDEF and BGLPERFDESC table join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix D. gmon support on Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to enable gmon profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional function in Blue Gene/L gmon support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple gmon.out.x files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enabling or disabling profiling within your application. . . . . . . . . . . . . . . . . . . . . . . . . . Collecting gmon data as set of program counter values instead of as histogram. . . . . Enhancements to gprof in the Blue Gene/L toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using gprof to read gmon.sample.x files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using gprof to merge a very large number of gmon.out.x files . . . . . . . . . . . . . . . . . . . 81 82 82 83 83 84 84 85 85 86 97

125 126 126 126 126 127 127 127 127

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 135 135 135 136 136

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

iv

Blue Gene/L: Performance Analysis Tools

Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.

Copyright IBM Corp. 2006. All rights reserved.

Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX 5L AIX Blue Gene IBM LoadLeveler PowerPC POWER Redbooks Redbooks (logo) System i System p Tracer

The following terms are trademarks of other companies: Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Excel, Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

vi

Blue Gene/L: Performance Analysis Tools

Preface
This IBM Redbook is one in a series of IBM publications written specifically for the IBM System Blue Gene supercomputer, Blue Gene/L, which was developed by IBM in collaboration with Lawrence Livermore National Laboratory (LLNL). This redbook provides an overview of the application development performance analysis environment for Blue Gene/L. This redbook explains some of the tools that are available to do application-level performance analysis. It devotes the majority of its content to Chapter 3, External Performance Instrumentation Facility on page 11, and Chapter 4, Performance Application Programming Interface on page 55.

The team that wrote this redbook


This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization (ITSO), Poughkeepsie Center. Gary L. Mullen-Schultz is a Consulting IT Specialist at the ITSO, Poughkeepsie Center. He leads the team that is responsible for producing Blue Gene/L documentation, and is the primary author of this redbook. Gary also focuses on Java and WebSphere. He is a Sun Certified Java Programmer, Developer and Architect, and has three issued patents. Thanks to the following people for their contributions to this project: Mark Mendell Kara Moscoe IBM Toronto, Canada Ed Barnard Todd Kelsey Gary Lakner James Milano Jenifer Servais Janet Willis ITSO, Poughkeepsie Center Charles Archer Peter Bergner Lynn Boger Mike Brutman Jay Bryant Tom Budnik Kathy Cebell Jeff Chauvin Roxanne Clarke Darwin Dumonceaux David Hermsmeier Mike Hjalmervik Frank Ingram Kerry Kaliszewski Brant Knudson Glenn Leckband
Copyright IBM Corp. 2006. All rights reserved.

vii

Matt Light Dave Limpert Chris Marroquin Randall Massot Curt Mathiowetz Pat McCarthy Mark Megerian Marv Misgen Jose Moreira Mike Mundy Mike Nelson Jeff Parker Kurt Pinnow Scott Plaetzer Ruth Poole Joan Rabe Joseph Ratterman Don Reed Harold Rodakowski Brent Swartz Richard Shok Brian Smith Karl Solie Wayne Wellik Nancy Whetstone Mike Woiwood IBM Rochester Tamar Domany Edi Shmueli IBM Israel Gary Sutherland Ed Varella IBM Poughkeepsie Gheorghe Almasi Bob Walkup IBM T.J. Watson Research Center

viii

Blue Gene/L: Performance Analysis Tools

Become a published author


Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html

Comments welcome
Your comments are important to us! We want our Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at:
ibm.com/redbooks

Send your comments in an e-mail to:


redbook@us.ibm.com

Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400

Preface

ix

Blue Gene/L: Performance Analysis Tools

Chapter 1.

Performance guidelines and tools


This chapter describes the process of using tools to analyze system and application performance.

Copyright IBM Corp. 2006. All rights reserved.

1.1 Tooling overview


A variety of tools are available to help understand your applications performance when running on Blue Gene/L. Some of these tools are written by IBM, others are written by independent software vendors (ISVs), and still others are open source efforts. Some tools have been ported to Blue Gene/L, and more are moving over every month. However, almost all tools work against the IBM System p platform. At times, it is advantageous to run your application on the System p platform and profile it there if the particular tool in which you are interested in using does not yet support Blue Gene/L. This section first discusses the main tool suite from IBM. Then, it examines the tools that you can use on the System p platform to help understand Blue Gene/L performance. Finally, it looks at tools that run natively with applications on Blue Gene/L.

1.1.1 IBM High Performance Computing Toolkit


The Advanced Computing Technology Center (ACTC), part of IBM Research in Yorktown Heights, New York, conducts research on the performance behavior of scientific and technical computing applications. Its role in IBM is to provide strategic technical direction for the research and development of server platforms to advance the state of the art in high performance computing offerings and solutions for IBM Clients in computationally intensive industries. Such industries include automotive, aerospace, petroleum, meteorology, and life sciences. IBM offers the IBM High Performance Computing Toolkit, a suite of performance-related tools and libraries to assist in application tuning. This toolkit is an integrated environment for performance analysis of sequential and parallel applications using the Message Passing Interface (MPI) and OpenMP paradigms. It provides a common framework for IBM mid-range server offerings, including the IBM System p and System i platforms and Blue Gene/L systems, on both AIX and Linux.

1.2 General performance testing


IBM recommends that you test an application on a System p machine before you run it on a Blue Gene/L system if possible. You should also use a memory size per Compute Node that is compatible with the Blue Gene/L architecture. This approach makes it possible to check both memory utilization and performance issues. Both the System p platform and the Blue Gene/L supercomputer use IBM XL compilers, which aid portability between the two systems.

1.2.1 Overview of the tools that are available on System p


For the best performance, it is good practice to obtain a performance profile for your application. IBM is porting its comprehensive performance analysis tools, the High Performance Computing Toolkit, to the Blue Gene/L supercomputer. In the meantime, we recommend that you perform profiling on a similar system, such as the System p platform. Most computational performance issues are the same on Blue Gene/L as on other reduced instruction set computer (RISC) processors, so this method usually identifies the main issues. For parallel performance, several MPI profiling tools are available, including the ones listed in the following sections.

Blue Gene/L: Performance Analysis Tools

IBM High Performance Computing Toolkit


The IBM High Performance Computing Toolkit is the foundation for all performance tools for Blue Gene/L and the IBM System family. The tools provide source code traceback of the performance data to help the user quickly identify any bottlenecks in the code. The toolkit includes low-overhead measurement of time spent in MPI routines for applications written in any mixture of Fortran, C, and C++. The tools include Xprofiler, MPI_tracer, MPI_Profiler, and PeekPerf. The toolkit provides a text summary and an optional graphical display.

Paraver
Paraver is a graphical user interface (GUI)-based performance visualization and analysis tool that you can use to analyze parallel programs. It lets you obtain detailed information from raw performance traces. To learn more about Paraver, refer to the following Web address:
http://www.cepba.upc.es/paraver/

MPE/jumpshot
MPICH2 has extensions for profiling MPI applications, and the MPE extensions have been ported to Blue Gene/L. For more information, refer to the following Web address:
http://www-unix.mcs.anl.gov/mpi/mpich/

1.2.2 Overview of tools ported to Blue Gene/L


The following tools have been ported to the Blue Gene/L platform: Kit for Objective Judgement and Knowledge (KOJAK)-based detection of performance bottlenecks
http://www.fz-juelich.de/zam/kojak/

Tuning and Analysis Utilities (TAU)


http://www.cs.uoregon.edu/research/paracomp/tau/tautools/

1.3 Message passing performance


Measuring the performance of message passing (MPI) in an application can quickly help identify trouble areas. MPI Tracer and Profiler consist of a set of libraries that collect profiling and tracing data for MPI programs. Performance metrics, such as the time used by MPI function calls and message sizes, are reported. These tools are available from the IBM ACTC. For more information about these and other tools that this organization provides, go to the following Web address:
http://www.research.ibm.com/actc/

1.3.1 MPI Tracer and Profiler


MPI Tracer and Profiler consists of a set of libraries that collect profiling and tracing data for MPI programs. Performance metrics, such as the time used by MPI function calls and message sizes, are reported. MPI Tracer works with the visualization tools PeekPerf and PeekView to better help users identify performance bottlenecks. PeekPerf maps performance metrics back to the source

Chapter 1. Performance guidelines and tools

codes. PeekView gives a visual representation of the overall computation and communication pattern of the system. MPI Profiler captures summary data for MPI calls. By this, we mean that it does not show you the specifics of an individual call to, for example MPI_Send, but rather the combined data for all calls made to that routine during the profile period. See Figure 1-1 for an example.

Figure 1-1 MPI Profiler summary data

Important: It is vital that you call MPI_Finalize in your application for the profiling function to correctly gather data. No changes to your source code are required to use the MPI Profiler function. However, you must compile using the debug (-g) flag.

1.4 CPU performance


The CPU performance tools are from the IBM ACTC. For more information about these and other tools that IBM ACTC provides, refer to the following Web address:
http://www.research.ibm.com/actc/

1.4.1 Hardware performance monitor


The hardware performance counter monitor module provides comprehensive reports of events that are critical to performance on IBM systems. In addition to the usual timing information, the hardware performance monitor can gather critical hardware performance

Blue Gene/L: Performance Analysis Tools

metrics. These might include the number of misses on all cache levels, the number of floating point instructions executed, and the number of instruction loads that cause Translation Lookaside Buffer (TLB) misses, which help the algorithm designer or programmer identify and eliminate performance bottlenecks.

1.4.2 Xprofiler
Xprofiler is among a set of CPU profiling tools, such as grof, pprof, pprof, and tprof, that are provided on AIX. You can use them to profile both serial and parallel applications. Xprofiler uses procedure-profiling information to construct a graphical display of the functions within an application. It provides quick access to the profiled data and helps users identify the functions that are the most CPU intensive. With the GUI, it is easy to find the applications performance-critical areas.

1.5 I/O performance


Understanding input/output (I/O) performance is as important as understanding application and CPU performance issues.

1.5.1 Modular I/O


Modular I/O is not yet officially supported on Blue Gene/L. Modular I/O addresses the need of application-level optimization for I/O. For I/O-intensive applications, the Modular I/O libraries provide a means to analyze the I/O behavior of applications and tune I/O at the application level for optimal performance. For example, when an application exhibits the I/O pattern of sequential reading of large files, Modular I/O detects the behavior and invokes its asynchronous prefetching module to prefetch user data. Tests with the AIX journaled file system (JFS) demonstrates significant improvement over system throughput when using Modular I/O.

1.6 Visualization and analysis


The PeekPerf tools is from the IBM ACTC. For more information about this and other tools that they provide, refer to the following Web address:
http://www.research.ibm.com/actc/

1.6.1 PeekPerf
PeekPerf visualizes the performance trace information generated by the performance analysis tools. PeekPerf also maps the collected performance data back to the source code, which makes it easier for users to find bottlenecks and points for optimizations. PeekPerf is available on several UNIX derivations (AIX, Linux) and Microsoft Windows.

Chapter 1. Performance guidelines and tools

1.7 MASS and MASSV libraries


The Mathematical Acceleration Subsystem (MASS) and MASSV libraries consist of a set of mathematical functions for C, C++, and Fortran-language applications that are tuned for specific IBM POWER architectures. You can learn more about these libraries at the following Web address:
http://www.ibm.com/software/awdtools/mass/support/

Both scalar (libmass.a) and vector (libmassv.a) intrinsic routines are tuned for the Blue Gene/L computer. In many situations, using these libraries has been shown to result in significant code performance improvement. Such routines as sin, cos, exp, log, and so forth from these libraries are significantly faster than the standard routines from GNU libm.a. For example, a sqrt() call costs about 106 cycles with libm.a, about 46 cycles for libmass.a, and 8 to 10 cycles per evaluation for a vector of sqrt() calls in libmassv.a. To link with libmass.a, include the following option on the link line:
-Wl,--allow-multiple-definition.

Blue Gene/L: Performance Analysis Tools

Chapter 2.

Comparison of performance tools


Two primary tools are provided to gather application-level performance data about Blue Gene/L applications: the External Performance Instrumentation Facility (EPIF, also known as Perfmon) and Performance Application Programming Interface (PAPI). In this chapter, we discuss the functions of each of these applications to provide you with information that enables you to decide which tool is best suited to help in which situations.

Copyright IBM Corp. 2006. All rights reserved.

2.1 External Performance Instrumentation Facility


Perfmon is an IBM performance tool that is designed specifically for Blue Gene/L. As indicated by the word external in the title External Performance Instrumentation Facility, Perfmon runs separately from the actual application. No changes to source code are required to run Perfmon; it operates completely externally from the application being measured. This affords Perfmon some advantages: Since it runs externally, the impact it has on the running application is significantly less profound. System administrators can measure aspects of application performance without requesting changes to (and subsequent recompilation of) source code. Viewing performance results is made easier by the viewer and data export utilities. In addition, data from Compute Nodes can be automatically aggregated, taking this responsibility off of the application or person who is analyzing the results. Of course, these advantages have some drawbacks: Since Perfmon is specific to Blue Gene/L, its functionality is not portable to other supercomputing platforms. Perfmon gathers performance data at the application level. It is not possible to get more granular data (for example, specific to a certain block of code).

2.2 Performance Application Programming Interface


PAPI is an industry-standard performance application programming interface (API) that is designed to allow support on a broad range of computing devices. It is supported by a broad range of organizations, encompassing academia and the industry. The main home page for PAPI is at the following Web address:
http://icl.cs.utk.edu/papi/

Some of the advantages of PAPI include: Since it is based on an industry standard, its functionality is portable to other supercomputing platforms. PAPI allows performance data to be gathered down to a detailed level. For example, a single line of code can be monitored if necessary. PAPI comes with the following disadvantages: PAPI has a more significant impact on the application that is being monitored, which can potentially change the characteristics of the application itself. Changing what is to be measured almost always requires modifications to the source code, with subsequent recompilation. Data must be aggregated, parsed, viewed, and so on with outside tooling. It is up to the application itself to write the gathered data in the format that it sees fit.

Blue Gene/L: Performance Analysis Tools

2.3 Summary comparison of Perfmon and PAPI


Table 2-1 summarizes and compares some of the similarities and differences of function between Perfmon and PAPI.
Table 2-1 Differences of function between Perfmon and PAPI Function Data aggregation Output of results Perfmon Automatic aggregation of data from all Compute Nodes is used by the job. It is automatically written to a file system or relational database. It is collected via JTAG network and written to data store with sockets (via Midplane Management Control System (MMCS)). Tooling is provided to convert to CSV format. A simple GUI facility is provided to allow for the display of detailed or aggregate results, and provides support to directly extract the desired results to CSV files. The GUI works on both data that has already been collected and data that is currently being collected for running applications. The extract processing to CSV files allows for the normal query functions of select and project. However, no built in subquery processing is supported through the extract tool. The extract tool can output the detailed data per node rank or aggregate data derived from the detailed data. No hooks are required in the code; the only requirement is that static library is linked in with user program. This is for Blue Gene only. Data is available both for specific nodes (ranks), as well as aggregate data for all nodes. However, this data encompasses the entire running application; it is not possible with Perfmon to scope the collection to a specific code block. Since data is gathered from an outside utility, there is little or no impact on performance characteristics of application being measured. The fact that the performance data gathered flows across the JTAG network also limits its impact on the application. Perfmon is started, stopped and configured by the system administrator, as it executes on the Service Node. Many configuration options can be specified upon startup of the Perfmon server. In addition, numerous environment variables can be set or changed to impact how performance data is gathered. Multiple servers can monitor data simultaneously, each with different configurations. PAPI The application must aggregate data from Compute Nodes (if desired). The application must write output through the I/O Node to the file system. The format of the output is application specific.

Viewing results

Since the format of the output is specific to the application that created it, no common method is available to view, parse, or format the results.

Hooks required Portability Granularity

Explicit hooks are required in the source code.

It is portable across numerous supercomputing platforms. Data gathered can be specific, to a given node (rank), block of code, and so on. This makes it possible to scope the performance collection data to a specific area (or rank) of the application to better pinpoint the problem. PAPI, since it involves modifications to the actual application source code, can more significantly change the runtime characteristics of the running application. In addition, the fact that performance data is written to the file system via the I/O Nodes can also impact the application. PAPI is controlled by the programmer.

Impact on watched application

Who uses/controls Configurability

Since PAPI calls are embedded in the source code, little external configuration is possible, unless the application explicitly codes this type of functionality itself.

Chapter 2. Comparison of performance tools

10

Blue Gene/L: Performance Analysis Tools

Chapter 3.

External Performance Instrumentation Facility


In this chapter, we take a detailed look at the External Performance Instrumentation Facility (EPIF), also known as Perfmon. We use both names interchangeably in this chapter.

Copyright IBM Corp. 2006. All rights reserved.

11

3.1 Overview of EPIF and Perfmon


Traditional approaches to collecting performance information for an application require that the application is instrumented by modifying the source code. Because the source code is altered, the instrumentation of the code affects the running of the application and potentially yields performance data that is not representative of the application when it is not instrumented. As Blue Gene/L positions itself to appeal to a broader range of applications, it is fundamental that performance data can be obtained in a manner that requires no application source code changes nor measurably alter the runtime performance of those applications. A set of tools provides the foundation for the EPIF for Blue Gene/L. Restriction: EPIF is dependent upon the use of the interval timer and establishes a SIGALRM handler. This can cause conflicts for some applications. The previous version of EPIF (the Perl version, which was called perfmon.pl) was not removed from the V1R3 distribution. To use the previous version, applications must specify the environment variable of BGL_PERFMON=1000. This variable identifies the previous default set of counters that are collected by perfmon.pl. Most, if not all users, should use the new Python version (perfmon.py), because it is the strategic direction of the tool set. However, perfmon.pl is still distributed for some cases where a customer might still want to use it. In addition, startperfmon is a script that still references the previous (Perl) version of EPIF. It can still be used to start the old perfmon.pl, but it is not intended to be used for the new (Python) version of EPIF. There are several reasons for this, but most are due to the fact that EPIF is now designed to be started and tailored for a specific partition or set of similar sized partitions, specifying different sample intervals, sample types, and so on. Multiple instances of the new version of EPIF are possible and likely, while starting only a single server instance is not likely. In addition, each potential instance of EPIF is installation dependent. The startperfmon script is still included in the V1R3 distribution, but we recommend that you do not use it. Important: The previous (Perl) version of EPIF, in addition to the startperfmon script, will most likely will be removed in a future release.

3.1.1 Objectives
The purpose of the EPIF is to provide the necessary support for the monitoring, collecting, and recording of performance information from Blue Gene/L Compute Nodes. The key objectives for this facility are listed here: This facility must be administered externally and be enabled in a way that allows all running applications to participate without requiring application source code changes. This facility must provide its function without measurably impacting the performance of any application running on Blue Gene/L. This facility must be able to associate the performance data collected from a given set of Compute Nodes to a specific instance of an application running in a partition that houses those Compute Nodes.

12

Blue Gene/L: Performance Analysis Tools

3.1.2 Goals and strategies


The EPIF is not intended to be an all inclusive, comprehensive set of performance monitoring tools to do extensive analysis and presentation of performance related data for Blue Gene/L. But rather, this facility provides: The means to collect performance data based on hardware counters without requiring modification to any application source code; at runtime, the application is unaware of the monitoring function An interactive interface to see results quickly and easily, with some fundamental distillation of data An interface to export or extract the performance data in different formats so that other tools not provided here can be used to do further analysis of that data There are many aspects to performance monitoring other than hardware counters, but this set of tools deals with hardware counters exclusively. Other performance tool needs exist for Blue Gene/L and may or may not be incorporated into this set. Currently, six commands compose the set of tools for the EPIF: perfmon: Allows for the collection of performance data based upon hardware counters for jobs running on Blue Gene/L. dsp_perfmon: Provides a simple graphical user interface (GUI) to view performance data and do some high-level distillation of that data. This tool works on data collected from jobs that were previously monitored with EPIF or on data that is actively collected for currently running jobs that are monitored with EPIF. exp_perfmon_data: Allows already stored EPIF data to be exported from the performance database tables of the Midplane Management Control System (MMCS). imp_perfmon_data: Allows EPIF data to be imported into the performance database tables of MMCS. ext_perfmon_data: Allows for performance data to be extracted into flat files for further analysis by tools that are not provided with this tool set. end_perfmon: Allows for an instance of EPIF to be cleanly ended prior to the ending criteria being met that was specified on the perfmon command.

3.2 Basic concepts


EPIF is a Blue Gene/L performance tool that is based on the use of hardware counters to monitor particular runtime metrics. The number of floating point addition or floating point multiplication operations performed, L3 cache misses, and the number of XP packets sent from a Compute Node are examples or performance metrics that can be monitored using hardware counters. The machine does not collect hardware counter information by default. You must first compile your application to link in this performance monitoring capability. When an application is compiled for Blue Gene/L, it can be linked with the performance counter library by adding -lbgl_perfctr.rts to the link step. This action is all that is necessary to allow EPIF to collect performance data for that application. No source code changes are required. The actual increment of each hardware counter is performed by the users Blue Gene/L job. However, the update of the hardware counters is done at such a low level of the machine that the performance implications to the application are negligible. The actual collection of the

Chapter 3. External Performance Instrumentation Facility

13

counter values, and any processing of that data, are done outside of the users Blue Gene/L job. Counters are collected during sample intervals. Sample intervals are user-defined, fixed amounts of time. At the beginning of a sample interval, the machine records the counter values. The counters collected during a sample interval are referred to as a sample. Before the first sample is taken, the counters are initially collected and recorded. This initial collection of counters is referred to as the starting counter values. The starting counter values do not have to start at zero. All counter values for future samples are given as a delta value from these starting counter values. When EPIF is started, a sample interval is specified. Sample intervals for all jobs occur at the same time. For example, if two jobs, A and B, are being monitored, sample 4 for Job A and sample 4 for Job B occur in the same window of time. Defining sample intervals in this manner allows for some level of performance analysis to occur across the various blocks within a Blue Gene/L system within a given time interval. See Figure 3-1 for more a more complete set of examples of job-start and job-termination scenarios over a series of collection intervals.

Figure 3-1 Examples of data collection for various job and interval conditions

By default, EPIF monitors all Blue Gene/L jobs. However, EPIF can be started with a filter so that only jobs with particular attributes are monitored. As Blue Gene/L jobs end, monitoring for those jobs automatically ends. As new Blue Gene/L jobs are initiated and if those jobs meet any of the specified filtering criteria, monitoring for each of those new jobs automatically

14

Blue Gene/L: Performance Analysis Tools

starts. If those new jobs are running applications that have been compiled to update hardware counters, those counters are collected and recorded by EPIF. Given the way that sample intervals are defined, when EPIF is first started, all of the jobs running at that time have their first sample recorded during the sample interval. The starting counter values are collected during sample interval 0 and the first sample taken is sample 1. All Blue Gene/L jobs that start at a later time will have their first sample recorded during a future sample interval, and that first sample will not be sample 1, but rather the number for the sample interval. For example, the first sample taken for Job C might be sample 35. The job started during sample interval 34 and the starting counters were recorded. The first sample is then recorded during sample interval 35. When EPIF is started, a sample type is defined. The sample type can either be detailed or summary. Detailed samples record and save the counter values for each sample interval, for each monitored job. Summary samples only record and save the counters for the last sample interval for each monitored job. The processing performed at counter collection time is the same for both sample types. The only difference is whether new values are saved for a sample interval or whether the previously saved counter values for a given job are over-written with values from the latest sample interval. Summary samples save storage space, but you lose the ability to do time analysis of the performance data. Detailed samples allow you to do time analysis of the performance data by comparing the counter values as recorded and saved during each of the sample intervals for a running job. By default, summary samples are collected. EPIF can collect up to 52 performance counters for each Blue Gene/L job. Each job can map the 52 locations to different counter definitions so that different metrics can be recorded for each job. Specific counter definitions are stored in the MMCS performance database table BGLPERFDESC. Another table within that database, BGLPERFDEF, is used to logically group a set of counters together to be collected for a Blue Gene/L job. A set of performance counters is identified by a counter definition ID that is given by the DEFINITION_ID column. The COUNTER_ID column can be used to join to the same named column in the BGLPERFDESC table to determine which counter definitions are defined for a given counter definition ID. When a Blue Gene/L job is initiated, a specific counter definition ID is bound to that job. The ID is determined from the environment variable named BGL_PERFMON. If BGL_PERFMON is not defined when a job starts, the Blue Gene/L job uses a counter definition ID of 1004. Counter definition ID 1004 is the default set of counters for EPIF to collect. See Appendix C, Perfmon database table specifications on page 81, to view the list of all possible counter definitions and the list of supported counter definition IDs. Storage is consumed within the external file system for the performance data collected by EPIF. The fixed storage cost required to save control information for a newly monitored job is approximately 72 KB per midplane. The storage required to save the 26 counters for counter definition ID 1004 is about 190 KB to 340 KB per midplane. More than a single instance of EPIF can be run on the system at a time. Each instance is run with its own set of monitoring attributes. For example, one instance of EPIF can monitor jobs in small partitions and collect detailed samples with a sample interval of 15 seconds. At the same time, a second instance of EPIF can monitor jobs in large partitions and collect summary samples with a sample interval of one minute. Criteria that can be used to indicate which jobs are to be monitored by an instance of EPIF include a list of user names, list of block IDs, minimum block size, maximum block size, and any other attribute found in the BGLJOB table. The block ID specification also allows for regular expressions, so that running jobs can be compared against a block ID pattern.

Chapter 3. External Performance Instrumentation Facility

15

While it may not normally make sense to monitor the same Blue Gene/L job from more than one EPIF instance, it is possible to do so. The two collections do not interfere with one another. Each EPIF instance stores its collected data in a different set of files. The output from EPIF is always stored to the external file system to be later accessed by the other commands within the External Performance Instrumentation Facility suite. Each instance of EPIF always creates a new directory and stores all files that pertain to that collection in that directory. The location where that directory will be created can be specified when EPIF is started. Note: While it is possible to monitor a given job using more than one instance of EPIF, all of those instances of EPIF use the same counter definition ID for that job. The counter definition ID that is used for a Blue Gene/L job is bound to the job, not to a running instance of EPIF. In addition to storing the performance data for an instance of EPIF to the external file system, the EPIF collection facility can also optionally store the data in the MMCS performance database. While the data to be stored to the external file system is done synchronously as the data is collected, the data is asynchronously written to the MMCS performance database. If the performance data is not written to the MMCS performance database as a function of the EPIF processing, the data can always be imported into the database at a later time with the imp_perfmon_data command, which is also found within this tool suite. See Figure 3-2 for an overview of the data collection and processing performed by EPIF.

Figure 3-2 Overview of EPIF data collection and processing

16

Blue Gene/L: Performance Analysis Tools

3.3 EPIF commands


In this section, we provide more details about the EPIF commands that are used to gather and manipulate external performance gathering on Blue Gene/L.

3.3.1 perfmon
The perfmon command starts an instance of the performance monitor tool. Many options can be specified to control the collection of performance data based upon the hardware counters. The remainder of this section gives a brief overview of the processing performed by EPIF and how the various command line options can be used to control the running of EPIF. See 3.5.1, Options for EPIF on page 44, for additional information. When EPIF starts, the first action performed is to determine the configuration for that instance of EPIF. Defaults are defined for all options, so no options are required for an invocation of EPIF. However, it is quite common to pass one or more options to customize the running of EPIF. Options are first taken from the command line, then from any honored environment variables, and finally from the defaults as established by the perfmon command. Note: In the documentation that follows, only the command line options are called out. Many of the command line options have corresponding environment variables. See the EPIF help text for more information about the specific environment variables that are honored by EPIF. The online help text can be accessed by using the following command:
perfmon.py -help

EPIF first validates the combination of options to be used for this instance of the command. If there are any conflicts or errors, warning or error messages are issued to the console file. Serious problems result in error messages that are sent to the console and to stderr. Error messages that are sent due to an incorrect configuration end the running of EPIF before any processing related to the collection of performance counters begins. All pertinent information regarding the processing being performed by EPIF is sent to the console file, or simply the console. The output that is sent to the console can be redirected using standard Linux command line redirection or with the --console option. The console option has a default path of /bgl/BlueLight/logs/Blue Gene/L system-name/EPIF. Various levels of filtering are available for the information sent to the console file. The filtering of information is set using the --verbose option, with the default value being 3. A verbose level of 3 echoes back the final configuration, messages for each job when monitoring is started and ended for it, and a summary after each sample interval. If desired, higher levels of verbosity give additional detailed flows from EPIF to its administrative threads. Error messages are always sent regardless of the verbosity level. Many aspects of the configuration control how performance counters are to be collected. One aspect is which Blue Gene/L jobs will be monitored. Five options can be specified to control which jobs will be monitored by EPIF. A Blue Gene/L job is only monitored if it meets all of the criteria established by all five options. The -block_id and --username options can be used to specify that only jobs running in a particular block or running under a particular user are to be monitored. Individual values or lists of values can be specified for each of these options. Regular expressions are supported for the list of block IDs. Additionally, the -min_block_size and -max_block_size options can be used to target jobs running in partitions that fall within a particular size range. The units for both of these two

Chapter 3. External Performance Instrumentation Facility

17

options are in midplanes, with 0 as the smallest min_block_size value allowed and the a as the smallest max_block_size value allowed. In addition to those four options, the --sql option can be used to provide any additional predicates to the SQL statement that is used to query the MMCS database for active Blue Gene/L jobs. The value specified on the sql option must start with a logical operator and is appended to the predicates that are already being used by the perfmon command processing to find Blue Gene/L jobs. Reference the attributes in the BGLJOB table to see the additional criteria that can be used to control the jobs that are monitored by an instance of EPIF. If none of the five options are specified, the default is to monitor all jobs running in all partitions. Another aspect of the configuration is how often EPIF will collect the performance counters for the Blue Gene/L jobs being monitored. This is called the sample interval and is specified with the --sample_interval option. The default sample interval is determined by the size of largest potential block that can have a job monitored by this instance of EPIF. Basically, three seconds are allowed for every midplane in the largest possible block, rounded up to 10 second boundaries. Therefore, the smallest possible sample interval is 10 seconds. The maximum allowed sample interval value is 1 hour. When a Blue Gene/L job is found by EPIF that is to be monitored, a single administrative thread is first spawned to collect additional information regarding that job. Various attributes of the job, the block in which the job is running, and the information related to the Blue Gene/L personality for the job are collected and recorded. The final piece of information that this administrative thread collects is the counter definition ID that the job is using. If the performance counter library was not linked into the application for the job, then there is no counter definition ID for the job. If the counter definition ID cannot be determined, EPIF stops monitoring for that job, without collecting any samples. Monitoring for other jobs is not affected. If the counter definition ID can be determined after the application starts to run, EPIF spawns additional administrative threads to help in the collection of the counter data during each ensuing sample interval. The first set of counter data collected is known as the starting counter values. Counter values for all ensuing sample intervals are given as delta values from these starting counter values. After the starting set of counter values is collected for a monitored job, EPIF collects the performance counters for that job at the beginning of each sample interval until the job ends or the monitor is ended. A snapshot of the counters is collected during each sample interval regardless of whether detailed samples or a summary sample is being collected. Whether detailed samples or a summary sample is collected is another configuration option.

Detailed samples refers to the concept that all of the snapshots of counter data are saved for future analysis. A summary sample means only the last of the snapshots of counter values is
saved. Detailed samples consume more hard disk drive, but allow for more detailed, time-based analysis. Summary samples consume less hard disk drive, but do not allow for any analysis of these intermediate collected samples. The sample type to be collected by an instance of EPIF is defined by the --sample_type option, and the default is to collect summary samples. After a configuration is validated by early EPIF processing, if indicated, EPIF delays the start of the collection process. The default is to start the collection process immediately, but the --start_delay option can be used to delay the collection for a period of time or the --start_time option can be used to indicate that the collection process is to start at a specific date or time. Similarly, three mutually exclusive options dictate how long EPIF will continue to monitor jobs. The --run_time option allows for an amount of time to be specified. After that amount of time

18

Blue Gene/L: Performance Analysis Tools

has expired, the monitoring process ends. The --samples option allows for a specific number of samples to be collected before the monitoring process is ended. Finally, the --end_time option allows for a specific date and time to be specified, at which time, monitoring ends. The time used is based on the clock of the Service Node. If it is determined that an instance of EPIF should be ended prior to the time specified on the perfmon command, the end_perfmon command can be issued to end the monitor normally. Using the end_perfmon command ends the EPIF processing in a controlled fashion. The Linux kill command can also be used to end the EPIF process. Any previously-collected data for that instance of the monitor is still preserved. However, any data for the sample interval in progress is lost. After the general monitoring process starts, existing Blue Gene/L running jobs are inspected to see if they should be included in the initial set of jobs to monitor. If one or more jobs are found, the initial information is collected from each of those jobs, a set of threads is spawned for each of those jobs, and the starting counters are then collected from each of those jobs. Each subsequent sample interval takes a snapshot of the counters until either the job ends or the monitoring process ends. As EPIF continues to run, new jobs are also sought out that meet the monitoring criteria. Two methods are used to find new jobs. At the beginning of each sample interval, new jobs are automatically found by querying the BGLJOB table. If any new jobs are found that meet the monitoring criteria, then those jobs are added to the set of jobs that are monitored, initial information about each of those newly found jobs is collected, and a set of threads is spawned for each of those new jobs. A starting set of counter values is collected and a snapshot of the counter values are taken on every subsequent sample interval until the job ends or the monitoring process is ended. In addition to finding new jobs at the beginning of each sample interval, optionally a user can indicate to have a daemon started by the EPIF process that does nothing but poll the BGLJOB table for new jobs. This polling is done based upon a time interval as dictated by the --poll_new_jobs option. By default, this option is set to three seconds, so that EPIF polls for new jobs every three seconds. If a new job is found by this daemon, the main EPIF process is sent a message and the previously-described processing starts for that newly-found job. If enough time exists within the current sample interval, the starting counters are immediately collected for that newly found job. Otherwise, the starting counters for the newly found job are collected at the beginning of the next sample interval. A poll_new_jobs option value of 0 indicates that a daemon is not to be used to find new jobs. All of the information collected by EPIF and its associated threads is saved to the file system. All of the data for a given instance of EPIF is stored in a single directory. The user has control over the location of this directory. The --monitor_path option indicates the path to use when EPIF creates the directory to store all of the data for an instance of EPIF. If the path does not exist, directories for the path are created by EPIF. The user has no control over the name of the directory used to house the data files for a given instance of EPIF, which is created under the directory specified with the --monitor_path option. The name of the directory has a fixed name portion and a timestamp portion to clearly identify the Service Node name and time that this instance of EPIF was started. The default monitor path value is /bgl/BlueLight/EPIF/. At the end of every sample interval, all of the necessary information that is collected by EPIF is always written to the file system. However, it is also possible to have EPIF asynchronously write the results to the MMCS performance database. The --import_to_database option indicates if this additional processing is to be performed. If the results are to be written to the database, an additional daemon is spawned by EPIF to handle this processing.

Chapter 3. External Performance Instrumentation Facility

19

At the beginning of every sample interval, if the previous sample interval generated data that must be imported into the database, a request is made to the daemon, and the results are asynchronously imported into the database by the daemon. This import facility can lag behind the actual collection of performance data that is being collected and written to the file system. However, if there are outstanding import requests when the monitoring process is to end, the ending of EPIF is delayed until all of the import requests have completed. The values for the --import_to_database option are True and False, with the default being False. Note: The data for a given instance of EPIF can always be imported into the MMCS performance database at a later time. The imp_perfmon_data command allows for such an import to be initiated by the user. If for any reason a set of counter values cannot be collected for one or more Compute Nodes for a given sample interval, no additional attempts are made during subsequent sample intervals to collect those same counter values. Monitoring continues for that job, but does not contain the counter values for those affected nodes. Appropriate messages are sent to the console for this situation. If for any reason one or more of the administrative threads abnormally ends, the remainder of the administrative threads for that monitored job are ended and monitoring for that job is ended. Again, appropriate messages are sent to the console. In either case, monitoring for other Blue Gene/L jobs is not affected.

Advanced options for the perfmon command


Three additional advanced options can control other timing aspects regarding how the counters are collected. It is possible that certain system environmental factors can effect the processing to be performed during a particular sample interval. Examples can be a workload spike or excessive network traffic. It might be the case that all of the necessary work that must be performed for a particular sample interval cannot be completed within the time allowed by the sample interval. These three options can be specified to effect how EPIF performs given these timeout situations. The first of these options is --sample_interval_extensions, which indicates the number of additional units of sample interval time that will be allowed for a given sample interval without any response from the administrative threads. EPIF uses multiple threads to monitor the various Blue Gene/L jobs. For example, if the sample interval is 30 seconds and --sample_interval_extensions is set to 3, a total of 2 minutes is allowed for any of the administrative threads to respond to any of the outstanding requests. If none of the threads respond within that total amount of time, EPIF ends. The options --thread_timeout and --thread_retries can be specified to affect the timeout and retry tolerances for these threads. It is usually not necessary to specify values for any of these three advanced options because the EPIF program chooses appropriate values given the runtime environment. But, if desired, the user can control these timing and timeout values. Additional advanced options can be specified to help define the threads that are spawned to monitor the various Blue Gene/L jobs. Each time a Blue Gene/L job is found that is to be monitored by EPIF, Perfmon spawns a set of threads to collect the performance counters for that particular job. Many options can be specified to control how these threads are managed by EPIF. It is not necessary to specify values for any of the four advanced options because the EPIF program chooses appropriate values based upon the runtime environment, but these values are highlighted here to show that it is possible to control certain aspects of the EPIF administrative threads. The --max_threads_per_system option can be specified to set an upper limit to the total number of administrative threads spawned across the entire Blue Gene/L system used to collect performance counters. If not specified, the value 256 is used as the default upper limit. 20
Blue Gene/L: Performance Analysis Tools

Two additional mutually exclusive options are available to control the number of threads that are spawned for a given monitored job. The --max_threads_per_job option gives an upper limit to the number of threads that are spawned for a monitored job and the --nodes_per_thread option gives the number of Compute Nodes that are monitored by a single administrative thread. By default, the EPIF program chooses the number of threads spawned for a given Blue Gene/L job based upon the number of Compute Nodes to be monitored in the partition. In general, the larger the number of Compute Nodes is, the larger the number of spawned threads is. Each time a collection is to be performed by an administrative thread, it opens a socket, issues JTAG requests, and then closes the socket. The --max_concurrent_threads option controls the number of concurrent threads, across the entire Blue Gene/L system that can actively have an open socket issuing EPIF-related JTAG requests. In general, the default number of concurrent threads is 16.

3.3.2 dsp_perfmon
The dsp_perfmon command displays the results of the data saved to the file system by EPIF. This command works for all EPIF instances, including those that are still actively collecting data. The raw counter data can be viewed, but many other items related to the data can also be viewed or derived. In this section, we briefly describe the data that is available from using the dsp_perfmon command. While the performance monitor data must be collected on the Service Node, the resulting data can be replicated to any platform that supports Python and the data analysis done there. The application windows shown in this document are from performance monitor files that are replicated to a mobile computer running Microsoft Windows XP. The ensuing analysis was performed there. It could have easily been done running Linux on a Front End Node. If you are viewing live data as it is being collected on the Service Node, doing that analysis on a Front End Node is normally the case, unless you are replicating the performance monitor data in real time. The dsp_perfmon command opens a window to view the data that is collected, or currently being collected, by EPIF. Depending on the platform, it is invoked using a command similar to the following example:
python dsp_perfmon.py

The perfmon command creates a main control file with a .mon extension. When opening a performance monitor file, this main control file should be opened. You can open this file by using the dsp_perfmon command, and selecting File Open from the menu bar. See Figure 3-3.

Chapter 3. External Performance Instrumentation Facility

21

Figure 3-3 Opening the main perfmon file

After you open the performance monitor file, the Display Performance Monitor Data window changes to display the files as shown in Figure 3-4.

Figure 3-4 Main panel after opening the main perfmon file

22

Blue Gene/L: Performance Analysis Tools

Notice that the name of the main performance monitor file embeds the Service Node name followed by the data base name at the beginning. A time stamp can also be found in the name. This time stamp is in the format of:
yyyy-mmdd-hhmmss_xx

Here xx is a fractional part of the seconds portion of the time stamp. This time stamp is representative of when the perfmon command was issued on the Service Node. After you open the main performance monitor file, two or more items are displayed that can be expanded. The first item, when expanded, shows attributes related to the display environment. The second item gives detailed information about the EPIF runtime environment. Each item after those first two are one-liners that exist for each monitored Blue Gene/L job. Expanding any of those one-liners gives information particular to each of those monitored jobs. Only one performance monitor file can be opened at a time within a given dsp_perfmon display window, but you can have more than one instance of the command running to view and compare results between two or more performance monitor files. The main performance monitor file is essentially a file with control information. Other files, with different file extensions within the same directory as the main performance monitor file, contain detailed information for each of the monitored Blue Gene/L jobs and the counters that have been collected for each of those jobs. All of the information necessary to do analysis using the dsp_perfmon command is contained within the files in that directory. Any necessary information from the MMCS database has been extracted into these files, and the MMCS database is not required during any of the analysis. This allows the dsp_perfmon analysis to be done on a different machine or platform. The files that contain information for each of the monitored Blue Gene/L jobs and the files that contain the counter information can be numerous and contain a lot of information. To prevent you from being overwhelmed by the amount of data, you can filter the results that are displayed to help you focus on certain data elements. The jobs to be displayed can be filtered by block ID, job ID, and user ID. An option is also provided that only shows Blue Gene/L jobs with non-zero counter values, such as those jobs that run applications that were compiled to update the hardware performance counters. All of the filtering criteria works together and can be specified in any combination. You can find all of these filtering options under the Filter menu item (Figure 3-5).

Figure 3-5 Filter menu items

Under the Format menu option (Figure 3-6) are options that are related to how samples and nodes are displayed. These attributes allow a user to alter how samples or nodes are grouped together on the display. Because many samples and nodes can be contained within the performance monitor data for a single job, all of this data does not expand immediately

Chapter 3. External Performance Instrumentation Facility

23

into lines of detailed data. Sample data and node data are grouped together. You can also drill down easily with a few mouse clicks to specific detail from a given sample for a given node.

Figure 3-6 Format menu items

Consistent data is always displayed. It does not matter if data is being displayed from a non-active performance monitor file or a monitor file that is actively collecting performance data. All of the data that is displayed, extracted, or drilled down into can be thought of as being taken from a snapshot of the performance monitor data at the end of a sample interval. The only time that new data is introduced is when a new monitor file is opened or the Window Refresh option is selected from the menu (Figure 3-7).

Figure 3-7 Window menu items

Data desciptions using the dsp_perfmon command


The following sections further describe the data that can be displayed and derived using the dsp_perfmon command.

Display Performance Monitor Attributes


If you expand this item, you see the following information: Filters in Effect (see Figure 3-8) Allows only jobs that have block IDs matching those in the block ID filter list to be displayed Allows only jobs that have job IDs matching those in the job ID filter list to be displayed Allows only jobs that have user IDs matching those in the user ID filter list to be displayed Shows only those jobs with non-zero counter values; provides for a method to filter out all Blue Gene/L jobs that are running applications not compiled to update hardware counters

24

Blue Gene/L: Performance Analysis Tools

Note: When switching between different performance monitor files, the filters are reset and not preserved.

Figure 3-8 Display performance monitor attributes: Filters in Effect

EPIF Runtime Capture Attributes


If you expand this item, you see the following information: All basic runtime data relevant to the collection, including start/stop time data Administrative thread information These threads are spawned by EPIF to perform the collection of the counter data.

Figure 3-9 EPIF runtime capture attributes: Administrative Thread Information

Chapter 3. External Performance Instrumentation Facility

25

All configuration items used for the collection Most of these items can be specified on the perfmon command line or with environment variables that EPIF recognizes. See 3.5.1, Options for EPIF on page 44, for more information.

Figure 3-10 EPIF runtime capture attributes: Configuration

Job specific information


The rest of the first panel has one line for each Blue Gene/L job that was, or is currently being, monitored. The lines give information to help identify each monitored job. Figure 3-11 lists the information that is available under each of the monitored jobs: Number of Completed Samples This line gives the number of samples that have successfully been completed for the job. Counters Collected During Original Sampling This line gives the sample number of the starting, first, and last samples collected. Alternate Starting Sample Number This line gives the alternate starting sample number. If desired, you can right-click this line to change the starting sample number used for all counter calculations, including any histogram data. Counter ID Filter If desired, you can right-click this line to filter the counters that are to be displayed. By default, all counters collected for the job are displayed. Number of Histogram Bins If desired, you can right-click this line to set the number of bins to be used when building histograms from the data for this job. By default, 20 histogram bins are used. Calculate Stats/Histograms with Normalized Values If desired, you can right-click this line to toggle whether descriptive statistics and histograms are calculated with raw counter values or values normalized to time. By default, normalized values are used. 26
Blue Gene/L: Performance Analysis Tools

Figure 3-11 Job specific information

Job Attributes (Figure 3-12) This line lists the attributes for the Blue Gene/L job.

Figure 3-12 Job specific information: Job Attributes

Chapter 3. External Performance Instrumentation Facility

27

Block Attributes (Figure 3-13) This line lists the attributes for the block used by the job.

Figure 3-13 Job specific information: Block Attributes

EPIF Attributes for Job (Figure 3-14) This line lists the specific EPIF runtime attributes for the job, including: Number of Administrative Threads Gives the number of administrative threads used to monitor the job Counter Definition ID Gives the counter definition ID used to monitor the job, and if expanded, all of the individual counter definitions for that counter definition ID EPIF File Names Gives the file names used for the performance data collected for this job EPIF Start/Stop Times Gives the time stamps when monitoring was started and ended, the starting and ending time stamps of the last sample, and the elapsed monitoring time for the job

28

Blue Gene/L: Performance Analysis Tools

Figure 3-14 EPIF attributes for job

Explore Data Using a Samples/Nodes/Counters Hierarchy


Drilling down allows you to see counter values for a given sample across all nodes for that sample. See Figure 3-15. Starting Counters Shows the values for the starting counters Per Sample Data Starting time stamp of this sample for this job Ending time stamp of this sample for this job Descriptive Statistics per Counter Index This line gives statistics pertaining to this sample, for all nodes and counters. These statistics are always calculated using the counter delta values, which is the difference between the counter values for the sample and the starting counter values. Basic (Min, Max, Range, Median, Mean, Mode, Cardinality, and so on); see Figure 3-15

Chapter 3. External Performance Instrumentation Facility

29

Figure 3-15 Per sample data

Histogram, Range=Number of Nodes, with counter values in increasing order

Figure 3-16 Histogram: Range equals number of nodes

30

Blue Gene/L: Performance Analysis Tools

Histogram, Range=Counter values, with counter values in increasing order (Figure 3-17)

Figure 3-17 Histogram: Range equals counter values

Under each of the histogram bins, information is given about the nodes within that bin. See Figure 3-18.

Figure 3-18 Histogram: Nodes within a bin

Chapter 3. External Performance Instrumentation Facility

31

Per node data Individual counter values, with delta values calculated from the starting counter values (Figure 3-19) The data for each individual counter includes the raw hex value, delta value as calculated from the starting counter value, and the value normalized to a rate per second. The hardware counters are updated by the machine approximately every six seconds. Even though counter values are collected using multiple threads, it is possible that the raw counter values collected for a given sample for different nodes come from different six second machine intervals. This normalized value makes it possible to accurately compare counter values from different nodes within the same sample interval.

Figure 3-19 Histogram: Per node data

32

Blue Gene/L: Performance Analysis Tools

Traverse to adjacent nodes from the current node (Figure 3-20) Using the view in Figure 3-19, we traversed from the node with rank 1, which has a torus value of (0,0,0) in the X+ direction to the node with rank 3 that has a torus value of (1,0,0). Then we traversed in the Y+ direction to the node with rank 120 with a torus value of (4,2,3). Finally, we traversed in the X- direction to the node with rank 4, with a torus value of (1,1,0). Any of the counter values along the way can be displayed for their values.

Figure 3-20 Histogram: Traversing to adjacent nodes from the current node

Chapter 3. External Performance Instrumentation Facility

33

Right-click the functionality within the data for a job. In the Alternate Starting Sample Number line (A), right-click to define an alternate starting sample number. In the Counter ID Filter line (B), right-click to specify a list to filter the counter descriptions to be shown.

A C

Figure 3-21 Job data

You see the Filter by Counter Id window (see Figure 3-22).

Figure 3-22 Filter by counter ID

34

Blue Gene/L: Performance Analysis Tools

In the Number of Histogram Bins line (C), right-click to specify the number of bins to use for histograms built from the data for this job. In the Starting Counters section (D), for any given sample, right-click to extract the starting counter values for that sample. In the heading for a given sample number (E), right-click to extract the counters and all histograms for that sample. By default, the descriptive statistics are not extracted. In the Descriptive Statistics line (F), right-click to extract the descriptive statistics and all histograms for that sample. By default, the counter values are not extracted. In a particular Counter Index (G) for a particular sample number, right-click to extract the histogram for just that counter, for that sample. By default, the counter values and descriptive statistics are not extracted. When you right-click the Alternate Starting Sample Number line (panel location A), you can define an alternative starting sample number. It is desirable to do this when you want all of the descriptive statistics and histogram data to be calculated using a different starting sample instead of the actual starting sample that was collected. This alternative starting sample number affects the information displayed and all ensuing extracts performed for that Blue Gene/L job. See Figure 3-23.

Figure 3-23 Setting the alternative starting sample number

Chapter 3. External Performance Instrumentation Facility

35

When you right-click the Counter ID Filter line, you can specify a list of numeric counter description values. Only values for those counter descriptions are calculated and shown. A similar window opens for you to change the number of histogram bins and whether statistics and histograms are to be calculated using normalized values. See Figure 3-24.

Figure 3-24 Changed the alternative starting sample number

36

Blue Gene/L: Performance Analysis Tools

After you right-click any of the panel locations from D through G (see Figure 3-21 on page 34), the Extract Perform Data window opens so that any of the options to the ext_perfmon_data command can be overridden. See 3.5.2, Options for ext_perfmon_data on page 49, for more information.

Figure 3-25 Extracting EPIF data window

Make any changes to the options, and click OK to perform the extract function.

Chapter 3. External Performance Instrumentation Facility

37

You see a final message (Figure 3-26) that indicates whether the extract is a success or if any problems were encountered during the extract.

Figure 3-26 Data extraction completion message

Likewise, the console from which the dsp_perfmon command was entered also gives information regarding each extract operation that is performed. See Figure 3-27.

Figure 3-27 Data extraction messages in console

38

Blue Gene/L: Performance Analysis Tools

Explore Data Using a Nodes/Samples/Counters Hierarchy


Drilling down, you can see counter values for a given node across all samples for that node. See Figure 3-28. Starting Counters Shows the values for the starting counters Per Node Data Descriptive statistics pertaining to this node, for all samples/counters Basic (Min, Max, Range, Median, Mean, Mode, Cardinality, and so on)

Figure 3-28 Per node data: Basic

Per sample data Individual counter values, with delta values calculated from the starting counter values

3.3.3 ext_perfmon_data
The ext_perfmon_data command extracts the data from a sample for one or more monitored Blue Gene/L jobs. By default, the data is extracted into one or more comma separated value (CSV) files. However, you can use any character, or set of characters, for the separator. The primary intent of generating data in this fashion is so that the results can be easily opened and processed with a spreadsheet-type application, such as Microsoft Excel and other types of data analysis applications. The following list explains the kind of data that can be extracted or derived.

Chapter 3. External Performance Instrumentation Facility

39

Counter values By default, counter values are extracted (see Figure 3-29). Each row represents the data values for a given node. The values extracted are normalized values to time. Each row contains the following columns: Node rank Torus address X value Torus address Y value Torus address Z value Each counter value being extracted Delta CPU Cycles Elapsed Time Node location information

Figure 3-29 Counter values

40

Blue Gene/L: Performance Analysis Tools

Descriptive statistics By default, descriptive statistics are not extracted. When requested, descriptive statistics are extracted into the same file as the counter values. Each row represents a particular statistic, with the columns being the same as for the counter values above. Columns for node rank, torus address values, delta CPU cycles, elapsed time, and node location are left blank. See Figure 3-30. The descriptive statistics that are provided by this option are:

Minimum: Minimum value for the counter across all nodes Minimum Node: Node rank with minimum value Maximum: Maximum value for the counter across all nodes Maximum Node: Node rank with maximum value Range: Range of values for the counter across all nodes Median: Median value of the counter across all nodes Median Node: Node rank with median value Mean: Mean value of the counter across all nodes Mode: Mode value of the counter across all nodes Mode Occurrences: Number of occurrences of the mode value Cardinality: Cardinality of the counter across all nodes Variance: Variance of the counter across all nodes Absolute Deviation: Absolute deviation of the counter across all nodes Coefficient of Absolute Deviation: Coefficient of absolute deviation Standard Deviation: Standard deviation of the counter across all nodes Coefficient of Standard Deviation: Coefficient of standard deviation

Figure 3-30 Descriptive statistics

Chapter 3. External Performance Instrumentation Facility

41

Histogram By default, histograms for each counter are not produced. When requested, a histogram is generated into a separate file, for each counter that is being extracted. The range for the histogram is the range of the values for the counter. The number of histogram bins can be specified, with the default being 20. See Figure 3-31. Each row represents the data for a particular histogram bin and contains the following columns:

Bin label: The minimum and maximum values for the bin Number of Values: The number of nodes falling in the bin range Node ranks: The rank of the nodes falling in the bin range Torus values: The (X,Y,Z) torus address value for the nodes falling in the bin range Location information: The location information for the nodes falling in the bin range

Figure 3-31 Histogram

Selection is also allowed as part of the extract operation. Basic Boolean predicates are supported to further filter the data to be extracted. Predicates can be specified with operands being either the node rank; the torus address X, Y, and Z values; or counter values within the EPIF data. An operand of %R is used for node rank, %X, %Y, %Z for torus address values of X, Y and Z respectively, and %Cxx for counter values, where xx is the ordinal position of the desired counter index within the counter definition being used. Most basic query capabilities are supported except for anything that requires subquery processing, which includes any correlation between values from different nodes. All basic logical expressions are supported, including ==, <, >, >=, <=, <>, !=, and in (list of values). Nested levels of and, or, and not are supported with multiple levels of parentheses. Numeric expressions are supported as an operand, and comparison between counter values for a given node can be specified as a predicate. The counter definition ID used for the extract is the same definition ID used when the counters were collected. By default, counters for the last sample for all jobs found in the performance monitor data are extracted. You can further filter the extract by specifying a list of job IDs. If a list of job

42

Blue Gene/L: Performance Analysis Tools

IDs is specified, the list must be enclosed with parentheses and the individual job ID values separated by commas. In addition, the number of columns extracted (counters defined by the counter definition ID) can be filtered by specifying a list of columns to extract. The column values represent the ordinal position within the counter definition for the desired counter indices. If a list of columns is specified, the list must be enclosed with parentheses and the individual column values separated by commas. All of the filtering options mentioned in this section affect the counters to be extracted, the descriptive statistics, and the histograms to be generated. See the help text under the command for more detailed information.

3.3.4 exp_perfmon_data
The exp_perfmon_data command exports the EPIF data from the MMCS performance database for a particular INSTANCE_ID and exports the results to a flat file. See the help text under the command for more detailed information.

3.3.5 imp_perfmon_data
The imp_perfmon_data command takes the data collected by the perfmon command and imports it into the MMCS performance database tables. By themselves, the EPIF tools have no dependencies on any of the EPIF data residing in these relational tables. The primary reason to do this is if you need to process the EPIF data from more than one instance of EPIF using SQL queries. See the help text under the command for more detailed information.

3.3.6 end_perfmon
The end_perfmon command ends an instance of EPIF cleanly before the ending criteria specified on the perfmon command is met. See the help text under the command for more detailed information.

3.4 Building the necessary Python packages


You can find an installation of Python in /usr/bin on the Service Node. The following packages should exist in the /usr/lib/python/site-packages directory. If not, you must build them as explained for each of the respective packages: Candygram You install this package into a directory named candygram, which is created by the installation process. This directory is created under the installed Pythons site-package directory. a. Download the Candygram-1.0.tar.gz file from the following Web address:
http://sourceforge.net/projects/candygram/

b. Run the following commands when prompted in the order shown:


tar -zxvf Candygram-1.0.tar.gz cd Candygram-1.0 python setup.py install

Chapter 3. External Performance Instrumentation Facility

43

PyDB2 Install the three files into the installed Pythons site-package directory, all related to DB2. a. Download the PyDB2-1.1.0-2.tar.gz file from the following Web address:
http://sourceforge.net/projects/pydb2/

b. Run the following commands when prompted in the order shown:


tar -zxvf PyDB2-1.1.0-2.tar.gz cd PyDB2-1.1.0-2 python setup.py install

mx You install this package into a directory named mx, which is created by the installation process. This directory is created under the installed Pythons site-package directory. a. Download the egenix-mx-base-2.0.6.tar.gz file from the following Web address:
http://www.egenix.com/files/python/mxDateTime.html

b. Follow the download instructions. c. Run the following commands when prompted in the order shown:
tar -zxvf egenix-mx-base-2.0.6.tar.gz cd egenix-mx-base-2.0.6 python setup.py install

This is all you need on the Service Node to run everything for EPIF except dsp_perfmon, which you do not want to run from the Service Node anyway. You need wxPython to run dsp_perfmon. If you want to use dsp_perfmon from a Front End Node, then you need to install wxPython. Use the instructions at the following Web address to obtain the code, as well as use the detailed installation instructions:
http://www.wxpython.org/download.php#sources

If you are not running dsp_perfmon from a Front End Node, you must copy some of the Python scripts to the system from which you plan to run dsp_perfmon. Specifically, you must copy: All of the .py modules that start with EPIF_ from /bglsys/bin dsp_perfmon.py ext_perfmon_data.py imp_perfmon.data.py

3.5 Typical command uses


This section shows some of the more common ways in which EPIF commands are used. The idea is to give you a starting point for various operations from which you can head in a direction better suited to your particular needs.

3.5.1 Options for EPIF


The following example shows a typical invocation of EPIF. This example assumes that the environment variables DB_PROPERTY, PERFMON_CONSOLE, and PERFMON_MONITOR_PATH are properly set. Otherwise, the options for this command set a sample interval of 60 seconds and an ending time for the performance monitor. By default, summary samples are collected and by default, all 52 counter samples are collected for all Blue Gene/L jobs. 44
Blue Gene/L: Performance Analysis Tools

python perfmon.py --sample_interval=60 --end_time=12:00:00

The console file is a flight recorder for the EPIF processing. Figure 3-32 shows an excerpt of a console file with a default verbosity level (4).

Figure 3-32 Typical console output for EPIF

The current set of basic options available for EPIF are explained in the following list. To see all of the possible options for the perfmon command, see the actual code prolog. -h, --help This option prints help information and exits. --[no]verbose This option determines how much information is sent to the console. --noverbose is the same as verbose=0. --verbose is the same as verbose=1.

Chapter 3. External Performance Instrumentation Facility

45

The various levels of verbosity are: 0: Gives a starting, final summary, and final messages; basic error messages 1: Same as 0 plus a summary message for each sample 2: Same as 1 plus a summary message for each job that is being monitored 3: Same as 2 plus an echo of the final perfmon configuration and snipit information about the starting and delta counters for each job being monitored 4: Same as 3 plus an echo of the messages sent to the threads monitoring the various jobs Help and usage information that is sent to the console includes the advanced options. 5: Same as 4 plus detailed timings for read_mem(s) and the parsing of counter data 6: Same as 5 plus additional dump information for internal objects 7: Same as 6 plus additional information about the success or failure for each message or request sent to the threads monitoring the various jobs The default is environment variable PERFMON_VERBOSE and then 3. --block_id='str' Monitoring is only performed for jobs that are running in the specified block IDs. This must identify a single block ID or a list of block IDs. If a list is specified, the block ID values must be separated with commas and specified as "(b1,b2,b3)", with the double quotation marks. Regular expressions are supported as a block ID name. The default is to monitor jobs from all block IDs. --console='str' The console is written to this path location. The default is the environment variable PERFMON_CONSOLE and then /bgl/BlueLight/logs/EPIF/. --dbproperties='str' This option allows the selection of a different db.properties file. The default is the environment variable DB_PROPERTY, and then ./db.properties. --end_time='str' The last sample starts no later than this specified time. This option is specified in a date and time format. This option is mutually exclusive with samples and run_time. If specified, the value must fall within one calendar month from the current time and must be at least one sample interval of time (--sample_interval option) later that the starting time for the monitor. --import_to_database='str' This option indicates whether the performance monitor results are to be asynchronously imported into the MMCS performance database. This option does not alter any other processing other than to issue multiple import requests for the performance data that is being generated. The values are True and False. The default is environment variable PERFMON_IMPORT_TO_DATABASE and then False.

46

Blue Gene/L: Performance Analysis Tools

--max_block_size='str' This option specifies the maximum block size from which jobs are considered for monitoring. The units for this option are in midplanes. The default is environment variable PERFMON_MAXIMUM_BLOCK_SIZE and then the number of midplanes that are currently active on the system. The number of active midplanes essentially includes all Blue Gene/L jobs on the system. --min_block_size='str' This option specifies the minimum block size from which jobs are considered for monitoring. The units for this option are in midplanes. The default is environment variable PERFMON_MINIMUM_BLOCK_SIZE and then 0. Zero essentially includes all Blue Gene/L jobs on the system. --monitor_path='str' The perfmon control object and starting counters are saved at this path location. The default is environment variable PERFMON_MONITOR_PATH and then /bgl/BlueLight/EPIF/. --poll_new_jobs='str' This option is the time interval used to poll for new jobs. It is specified in a date and time format. This option is mutually exclusive with poll_int. A value of 0 indicates that no polling should be performed for new jobs. Instead, new jobs are only discovered at the beginning of each sample interval. If specified, the value must be less than or equal to 15 minutes. The default is environment variable PERFMON_POLL_NEW_JOBS and then 3 seconds. --run_time='str' The last sample starts no later than after this specified amount of time. This option is specified in a date and time format. This option is mutually exclusive with samples and end_time. If specified, the value must fall within one calendar month from the current time and must be at least one sample interval of time (--sample_interval option) later that the starting time for the monitor. The default is environment variable PERFMON_RUN_TIME and then 24:00:00. --sample_interval='str This option indicates the time interval to be used between samples. It is specified in a date and time format. The value specified must be less than or equal to 1 hour. This value must be specified and is mutually exclusive with job_int. The environment variable PERFMON_SAMPLE_INTERVAL can be used to specify this value. The default is 0, which means that the system chooses a value based upon the maximum size block that can be monitored. --samples=int This option indicates the number of samples to collect. This option is mutually exclusive with end_time and run_time. If specified, the value must be positive and less than or equal to 100,000.

Chapter 3. External Performance Instrumentation Facility

47

--sample_type='str' This option indicates whether the individual sample data for each monitored job is to be saved or if an overall summary of that sampled data is to be saved. If D or d is specified for a first character, then detailed samples are collected. Otherwise, a summary sample is collected. The default is environment variable PERFMON_SAMPLE_TYPE and then summary. --start_delay='str' This option indicates the amount of time to delay before starting to monitor for running jobs. It is specified in a date and time format. This option is mutually exclusive with start_time. If specified, the value must fall within one calendar month from the current time. --start_time='str' This option indicates the time to start monitoring jobs. Any time within a calendar month from the current time is allowed. This option is mutually exclusive with start_delay. If specified, the value must fall within one calendar month from the current time. The default is an immediate start of perfmon. --sql='str' This SQL option is appended to the WHERE clause of the SQL query that is used to determine which jobs should be considered for monitoring purposes. If only jobs with certain attributes should be monitored, this option allows for these additional predicates to be specified. When specifying this option, you must provide a leading logical operator. This option works in conjunction with the blockid and username options. Any values specified for the blockid and username options are ANDed together. Then the clause specified in the sql option is appended last. The default is for no additional selection predicates to be used. --user='str' This option specifies the user to drop to if it is currently running as the root. Default is None. --username='str' Monitoring is only performed for jobs that are running under the specified user names. This must identify a single user name or a list of user names. If a list is specified, the user name values must be separated with commas and specified as "(u1,u2,u3)", including the double quotation marks. If specified, the value or values must be a valid user name that currently is defined to the system. The default is to monitor jobs for all user names. Some of the options for the perfmon command are mutually exclusive. See the help text for details pertaining to these options and the possible values for these options. Other environment variables are also supported.

48

Blue Gene/L: Performance Analysis Tools

Note: Online documentation with Python is relatively easy. You can use -h to see the help text for a Python command from the command line, but a better way is from a browser. From a command line, enter the following command:
pydoc -p 8888

In your browser, in the address field, type:


http://localhost:8888/

You can now click any of the Python files listed to see the online documentation. The files are listed by directory, as installed on your system, with hot links between the various files. The files that are listed can be Python packages, modules, or files.

3.5.2 Options for ext_perfmon_data


After you start EPIF, you can extract samples at any time. The following example shows extracting all of the counters from the last sample for all Blue Gene/L jobs found in a performance monitor file. A new file is generated for each of the jobs found.
python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/

This command extracts data from the performance monitor located at /bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ and extracts the results to /bglscratch/perfmon/tmp/. By default, counter values are extracted, but no statistics are generated at the end of the file that contains the counters. Histograms are not generated. All columns from the counter definition and counter values from all nodes are extracted. The following example is of the extract file name that is generated by the command for one of the Blue Gene/L jobs that is found. Notice how the performance monitor time stamp, job ID, starting and ending sample numbers, and the counter definition ID are embedded into the extract file name.
bgl1gb-bgdb0-perfmon-2006-0420-160422.62_J_54324_S_15_E_30_C_1000.csv

The following example is a more complicated extract example:


python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/ --stats=true --histogram=true --columns=(3,5,7) --histogram_bins=100 --selection="(%X = 4 and %Z = 5) or %C2 > 2.5E9"

This command extracts data from the performance monitor located in /bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ and extracts the results to /bglscratch/perfmon/tmp/. By default, counter values are extracted, and descriptive statistics are generated at the end of the file that contains the counters. Histograms are generated into additional files for the counters that are being extracted, and a histogram bin size of 100 is used. Only columns 3, 5, and 7 are extracted from the counter definition ID. Also, only those counter values are included in the extract when a possible contributing node has a torus address of (X = 4 and Z = 5), or the second column (second counter in the counter definition ID) is greater than 2.5 x 109. The following example is of the histogram file name generated by the command for one of the histograms that is generated for one of the counters, for one of the Blue Gene/L jobs found. Notice how the performance monitor time stamp, job ID, starting and ending sample numbers,

Chapter 3. External Performance Instrumentation Facility

49

counter definition ID, counter name, and the number of histogram bins are embedded into the histogram file name.
bgl1gb-bgdb0-perfmon-2006-0420-160422.62_J_54324_S_15_E_30_C_1000_N_BGL_UPC_TS_XP_PKTS_B_20 .csv

Example 3-1 shows how this command can be used to create a single output file from multiple extract operations. The objective of the processing that follows is to extract all of the counter values for a node with a torus address of (1,2,3) and all of its adjacent nodes. The -extract-to file does not exist when the first extract command executes.
Example 3-1 Using ext_perform_data.py to create a single output file from multiple extract operations python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/xTemp.csv --job_id=18960 --selection="(%X == 1 and %Y == 2 and %Z == 3)" --append=true python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/dlherms/tmp/xTemp.csv --job_id=18960 --selection="(%X == 0 and %Y == 2 and %Z == 3)"--append=true python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/xTemp.csv --job_id=18960 --selection="(%X == 2 and %Y == 2 and %Z == 3)"--append=true python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/xTemp.csv --job_id=18960 --selection="(%X == 1 and %Y == 1 and %Z == 3)"--append=true python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/xTemp.csv --job_id=18960 --selection="(%X == 1 and %Y == 3 and %Z == 3)"--append=true python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bglsnbgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/xTemp.csv --job_id=18960 --selection="(%X == 1 and %Y == 2 and %Z == 2)"--append=true python ext_perfmon_data.py --perfmon_dir=/bglscratch/perfmon/tmp/bgl1gb-bgdb0-perfmon-2006-0420-160422.62/ --extract_to=/bglscratch/perfmon/tmp/xTemp.csv --job_id=18960 --selection="(%X == 1 and %Y == 2 and %Z == 4)"--append=true

Figure 3-33 shows the output yielded into the /bglscratch/perfmon/tmp/xTemp.csv file.

Figure 3-33 CSV output from ext_perfmon_data

50

Blue Gene/L: Performance Analysis Tools

The following list indicates the current set of basic options available for ext_perfmon_data: --monitor_path='str' The perfmon control object, the counter files, and all other files related to the perfmon run are found at this location. This option must be specified. -h, --help This option prints help information and exits. --[no]verbose This option determines how much information is sent to the console. --noverbose is the same as verbose=0. --verbose is the same as verbose=1. The describes the various levels of verbosity are: 0: Gives a starting and final message; basic error messages 1: The same as 0, plus a summary message for each job 2: The same as 1, plus a summary message for each histogram generated Help and usage information sent to the console includes the advanced options. The default is verbose=1. --append='str' This option indicates whether an extracted counter or stats data will be appended to the --extract_to file or cleared first. This option only applies if a specific file name is given in the --extract_to options. This option never applies to an output file that is created by this command for histogram purposes. If an --extract_to file is appended to, the column headings are not output. Otherwise, column headings are always output before any counter or stats data is output. This assures that column headings always exist once, and only once, as the first row in an --extract_to file. The values are true and false, with the default being false. --counters='str' This option indicates whether the individual counter values for each node are to be extracted. The values are true and false, with the default being true. --columns='str' Data is only extracted for the columns that are listed. The column values are with respect to the counter definition in effect. The valid values for this option are: Single column value List of column values, (COLUMN_1, COLUMN_3, COLUMN_7, ...) "all", which extracts all columns "none", which extracts no columns for the counter definition in effect This effectively extracts only the node definitions, with no counter data. For the file that contains the extracted counters and stats, the columns is in the order as specified on this option. Otherwise, the columns are always extracted in the order as given in the counter definition ID. The default is all.' --extract_to='str' Counters are extracted to this file. If only a path is specified, then a file name is generated from the perfmon control file, the job ID, the sample number, and the counter definition in

Chapter 3. External Performance Instrumentation Facility

51

effect for the extract operation. If only a directory is to be specified, then this value must end with a forward slash (/). If a path and file name or simply a file name is specified, then data for only a single job can be extracted. The file name is used as is for the counters and stats portion of the extract. The file name appended with information about the counter name and the number of histogram bins is used for any histogram files that are generated. For files that contain the counter and stats information and the system generates the file name, the following example indicates the format for that generated file name:
'dirname for monitor file'_P_'jobid'_S_'starting sample #'_E_'ending sample #'_C_'defid #'.csv

For system generated histogram file names, the following example indicates the format for that generated file name:
'dirname for monitor file'_P_'jobid'_S_'starting sample #'_E_'ending sample #'_C_'defid #'_N_'counter name'_B_'# histogram bins'.csv

The default is environment variable PERFMON_EXTRACT_TO and then the current working directory. If the file for the counters and stats information already exists and --append=true is not specified, the file is overwritten. If the name that is generated by the command for a histogram file collides with a file that already exists on the system, that file is overwritten. --histogram='str' This option indicates whether a histogram with a range of the counter values in increasing order is to be generated for each of the counters and each of the jobs being extracted. A separate file is generated for each counter value, for each job. Histogram file names are similar to the counter/stats file names that are generated by the system, except the counter name and number of histogram bins used are also appended to the name. The values are true and false, with the default being false. --histogram_bins=int This option indicates the number of histogram bins to use. It only applies if the --histogram option is true. The value must be greater than zero and less than or equal to 32000. The default value is 20. --job_id='str' This option indicates the job ID or IDs that have sample data extracted. Each job is extracted to an individual file. The valid values for this option are: Single job ID value List of job IDs, (JOB_ID1, JOB_ID2, JOB_ID3, ...) The value "all", which extracts data for all job IDs in the performance monitor The default is all. --normalize='str' This option indicates whether the extracted counter values should be normalized to time. The values are true and false. The default is true. --sample=int This option refers to the sample for the job ID that will have counters extracted. A value of 0 indicates that the starting counters are to be extracted. A value of -1 indicates that the counters for the last collected sample for the job ID will be extracted. Otherwise, this number must identify a valid sample number for the job ID within the perfmon data. 52
Blue Gene/L: Performance Analysis Tools

This option only applies if the counters option is true. In all cases, the counters extracted for the sample are always given as a delta from the starting counters. Therefore, the starting counters are always extracted as zeroes. If summary samples were used during the original perfmon collection (sample_type=summary), then this value must be defaulted or specified as -1 for the extract to work. The default value is -1. --selection='str' Basic Boolean predicates can be specified with this option to further filter out the data to be extracted. Predicates can be specified for the node rank, the torus X, Y, and Z values, and for values for counters within the perfmon data. Specify %R for selection against node rank, %X, %Y, and %Z respectively for selection against the torus X, Y and Z values, and %Cxx, where xx is the counter within the counter definition in effect to provide selection for a given counter. All basic logical operators are supported, including ==, <, >, >=, <=, <>, !=, and in (list of values). Nested levels of 'and', 'or', and 'not' are supported with multiple levels of parenthesis. All basic numerical operators are supported for arithmetic expressions, including +, -, *, /, **, and %. When specifying the predicates, separate each operand and operator with at least one space. However, no blank space is required following an open parenthesis, nor is one required before a closed parenthesis. Basically, most basic query capabilities are supported except for anything that requires subquery processing, which includes any correlation between values from different nodes. The default value is no additional predicates. --separator='str' This option indicates the character or characters that are to be used as separators for the generated file. The default is a single comma (,). --starting_sample_number=int For the purposes of descriptive statistics and histograms, this option gives the sample number for the starting counters. A value of 0 indicates that the starting counters as originally collected are to be used as the starting counters for the extract operation. Otherwise, this number must identify a valid sample number for the job ID within the perfmon data. This option only applies if either the histogram or stats option is true, and detailed samples were originally collected. The default value is 0. --stats='bool' This option indicates whether descriptive statistics should be extracted for the identified sample. The statistics are extracted immediately following any extract of counter values in that same file. The values are true and false, with the default being true. --user='str' This option specifies the user to drop to if currently running as the root. The default is None. See the help text or the online documentation for details pertaining to these options and the possible values for these options.
Chapter 3. External Performance Instrumentation Facility

53

3.5.3 dsp_perfmon
After you start EPIF, you can enter the following command to display the results at any time:
python dsp_perfmon.py

You cannot specify a performance monitor file to open on the invocation of the command. Instead, select File Open from the menu to open the desired performance monitor control file. That file has an extension of .mon. If environment variable PERFMON_MONITOR_PATH is defined, then the open dialog starts at that path location.

3.5.4 exp_perfmon_data
See the help text or the online documentation for more information regarding this command.

3.5.5 imp_perfmon_data
See the help text or the online documentation for more information regarding this command.

54

Blue Gene/L: Performance Analysis Tools

Chapter 4.

Performance Application Programming Interface


This chapter provides details about the Performance Application Programming Interface (PAPI) support that is provided on Blue Gene/L. For more information about PAPI in general, refer to the following Web address:
http://icl.cs.utk.edu/papi/index.html

Copyright IBM Corp. 2006. All rights reserved.

55

4.1 PAPI implementation


The PAPI implementation on Blue Gene/L is based on the 2.3.4 release that was available at the time when the work was initiated. The PAPI library consists of two parts: the common library API and a substrate interface. The substrate interface (often called the substrate) contains all the platform-specific code in a PAPI implementation, while the main code is identical among all platform implementations. This particular port of PAPI to the Blue Gene/L Compute Node Kernel conforms to this with a few minor modifications as detailed in 4.1.3, Modifications to PAPI on page 58.

4.1.1 The linux-bgl PAPI substrate


The PAPI substrate for the Blue Gene/L Compute Node Kernel is located in a subdirectory of the PAPI distribution named linux-bgl. The substrate is built in top of the bgl_perfctr application programming interface (API) and uses this API for all hardware counter manipulation. The substrate enables a fully functional PAPI v2 library, including overlapping counters. Due to lack of operating system support and the nature of the intended use of the Blue Gene/L machine, the PAPI_overflow() function is unimplemented. Also a call to this function returns PAPI_ESBSTR according to the library convention. There is no notion of virtual CPU time in the Compute Node Kernel. For this reason, both PAPI_get_real_cyc() and PAPI_get_virt_cyc() are mapped to the CPU time base register. For the same reason, PAPI_get_real_usec() and PAPI_get_virt_usec() report the same amount of elapsed time.

4.1.2 PAPI event mapping for Blue Gene/L


The Blue Gene/L substrate for PAPI includes a default mapping of standard PAPI events to available counters in the Blue Gene/L hardware counter infrastructure. Due to the nature of the application-specific integrated circuit (ASIC) design of Blue Gene/L, many events available on commodity machines are not available on this platform. This typically includes events that are only detectable inside the PPC cores of the ASIC. Examples of such events are L1 cache events, branch prediction events, and instruction counts. The ASIC design of Blue Gene/L makes available to the user a complete new set of events that relate to states in the network controllers on the chip. Through the PAPI native event mechanism, any event that is available in the universal performance counter (UPC) or floating point unit (FPU) counters can be programmed and controlled through PAPI. A native event is handled in the same way as the PAPI predefined events and passed through the same API calls. The difference is that, instead of passing a PAPI predefined event name, a bit pattern that corresponds to the event code and, where applicable, an edge detection mask are used. This is shown in Example 4-1.

56

Blue Gene/L: Performance Analysis Tools

Example 4-1 PAPI native event format for Blue Gene/L #include "papi.h" include "bgl_perfctr.h" int eventFPU, eventUPC; /* Code initializing PAPI not shown here . . . */

/* Encode a BG/L native event for PAPI */ eventFPU= BGL_2NDFPU_TRINARY_OP & 0x3FF; eventUPC= BGL_UPC_L3_PAGE_OPEN & 0x3FF | BGL_PERFCTR_UPC_EDGE_RISE << 10; retval=PAPI_add_event(&evset,eventFPU); retval=PAPI_add_event(&evset,eventUPC);

To simplify the usage of some of the communication-related events and to encourage the usage of these counters, the standard PAPI event mapping has been expanded with several new presets designed for Blue Gene/L. Example 4-2 shows the full set of new events.
Example 4-2 New PAPI nonstandard predefined events on Blue Gene/L PAPI_BGL_OED Oedipus operations The event is a convenience name for: {BGL_FPU_ARITH_OEDIPUS_OP,0,0} PAPI_BGL_TS_32B Torus 32B chunks sent The event is the sum of the following 6 events: {BGL_UPC_TS_XM_32B_CHUNKS,BGL_PERFCTR_UPC_EDGE_RISE,0} and similarly for _XP_, _YM_, _YP_, _ZM_ and _ZP_ PAPI_BGL_TS_FULL Torus no token UPC cycles The event is the sum of the following 6 events: {BGL_UPC_TS_XM_LINK_AVAIL_NO_VCD0_VCD_VCBN_TOKENS, BGL_PERFCTR_UPC_EDGE_HI,0}, and similarly for _XP_, _YM_, _YP_, _ZM_ and _ZP_ PAPI_BGL_TR_DPKT Tree 256 byte packets The event is the sum of the following 6 events: {BGL_UPC_TR_SNDR_0_VC0_DPKTS_SENT, BGL_PERFCTR_UPC_EDGE_RISE,0}, {BGL_UPC_TR_SNDR_0_VC1_DPKTS_SENT, BGL_PERFCTR_UPC_EDGE_RISE,0}, and similarly for SNDR_1_ and SNDR_2_ PAPI_BGL_TR_FULL UPC cycles (CLOCKx2) tree rcv is full The event is the sum of the following 6 events: {BGL_UPC_TR_RCV_0_VC0_FULL,BGL_PERFCTR_UPC_EDGE_HI,0}, {BGL_UPC_TR_RCV_0_VC1_FULL,BGL_PERFCTR_UPC_EDGE_HI,0}, and similarly for RCV_1_ and RCV_2_

The communication events are designed to provide easy aggregated counts of the traffic that occurs at each node. The PAPI_BGL_TS_32B event counts the number of all 32-byte data chunks that have been sent from the node. This includes traffic injected at the node and traffic that cuts through the network controller. The same holds true for the PAPI_BGL_TR_DPKT event that reports tree network traffic.

Chapter 4. Performance Application Programming Interface

57

For the two duration count events defined, PAPI_BGL_TS_FULL and PAPI_BGL_TR_FULL, the count at each UPC cycle is effectively multiplied by the number of channels that experience the condition. That is, if both the X-minus and the Y-plus first in, first outs (FIFOs) experience the condition of no tokens available, both contribute with one count to each UPC clock cycle (every second CPU cycle) until sufficient token acknowledgements are received.

4.1.3 Modifications to PAPI


The standard PAPI distribution, excluding the Blue Gene/L specific substrate, is unchanged from the official release version, except for the following modifications: The following set of new predefined events was added to the existing set of events: PAPI_BGL_OED (Oedipus operations in FPU0) PAPI_BGL_TS_32B (number of 32 byte packets sent on a torus network) PAPI_BGL_TS_FULL (number of UPC cycles torus links with no available tokens) PAPI_BGL_TR_DPKT (number of packets sent on the tree network) PAPI_BGL_TR_FULL (UPC cycles number of full tree receivers)

The semantics of PAPI_library_init() is changed from the standard distribution. In Blue Gene/L, PAPI_library_init() is a synchronizing call that should be executed by all processes on the partition. It uses the global barrier with a pre-set timeout to initiate the periodic timers that prevent counter overflows. This assures that these interrupts are localized in time over the set of allocated nodes. In virtual node mode, this means that PAPI_library_init should be called by all processes, including the processes that are running on CPU1 on each node. When PAPI_library_init() is called on a partition where not all nodes are participating in the call, a global barrier timeout occurs, and no global synchronization is achieved.

4.2 Examples of using hardware performance monitor libraries for Blue Gene/L
This section provides examples of using the hardware performance monitor on Blue Gene/L.

4.2.1 PAPI library usage examples


Example 4-3 shows an example program using the PAPI library API. This examples illustrates the configuration of five counters into an event set as well as, start, stop, read and reset of this event set. Measurements are taken over the fpmaddv subroutine, which is a nave implementation of an floating-point multiply-add (FMA) instruction-like operation on three input vectors and one output vector using the Blue Gene/L specific floating-point parallel multiply-add (FPMA) instruction operation. In the experiment, five counters are set up. The counters used are the time base register and the four floating point unit registers. The order of the events when printed is: 1. 2. 3. 4. 5. PAPI_TOT_CYC BGL_FPU_ARITH_OEDIPUS_OP BGL_2NDFPU_ARITH_OEDIPUS_OP BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_LDST_QUAD_LD

The counters are started, some load operations are performed, and then the vectorized FMA routine is called. After this, the counters are read, but left running. Before repeating the call to the FMA routine, the running counters are reset to zero, without stopping or restarting them. The FMA routine is called, and the counters are stopped. 58
Blue Gene/L: Performance Analysis Tools

To illustrate the effect of using both FPUs, the code is run both in coprocessor mode and virtual node mode (see Example 4-3). As expected, the registered number of counts is zero in the second FPU when run in coprocessor mode. In virtual node, mode counts are registered in both units, since both units are active. This illustrates the property of the counters that the hardware counters are a shared resource between the two processes on the node in virtual node mode. Example 4-3 also illustrates that the library interface itself resolves multiple access to the hardware as well as the virtualized counters. Although both processes create an event set and add counters to it, the library recognizes that the same hardware counter can be reused. Similarly, when a process releases a counter, the underlying hardware counter might remain allocated, if it is used by the other processor.
Example 4-3 PAPI example code #include #include #include #include <stdio.h> <stdlib.h> "papi.h" "bgl_perfctr_events.h"

#define N 8 #define NCOUNTS 5 int main(int argc, char* argv[]) { double v1[N], v2[N], v3[N], r1[N], r2[N]; double a=1.01,b=1.02,c=1.03,t=0.0,t2=0.0; int i, rank; int perr, ev_set; int encoding; long_long counts[NCOUNTS]; #include "bglpersonality.h" #include "rts.h" if(PAPI_VER_CURRENT!= (perr=PAPI_library_init(PAPI_VER_CURRENT))) printf("\nPAPI_library_init failed. %s\n",PAPI_strerror(perr)); { BGLPersonality me; rts_get_personality(&me,sizeof(me)); if(me.xCoord != 0 ) goto fine; if(me.yCoord != 0 ) goto fine; if(me.zCoord != 0 ) goto fine; } for(i=0;i<N;i++) { v1[i]=1.01+0.01*i; v2[i]=2.01+0.01*i; v3[i]=3.01+0.01*i; r1[i]=v1[i]*v2[i]+v3[i]; } if((perr=PAPI_create_eventset(&ev_set))) printf("\nPAPI_create_eventset failed. %s\n",PAPI_strerror(perr)); /* encoding=( BGL_FPU_ARITH_MULT_DIV & 0x3FF ); encoding=( BGL_FPU_ARITH_ADD_SUBTRACT & 0x3FF ); encoding=( BGL_FPU_ARITH_TRINARY_OP & 0x3FF ); */

Chapter 4. Performance Application Programming Interface

59

if((perr=PAPI_add_event(&ev_set,PAPI_TOT_CYC))) printf("PAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_FPU_ARITH_OEDIPUS_OP & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_2NDFPU_ARITH_OEDIPUS_OP & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_FPU_LDST_QUAD_LD & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_2NDFPU_LDST_QUAD_LD & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); printf("\nAssigning a vector of length %1d and computing " "A()=B()*C()+D().\n",N); if((perr=PAPI_start(ev_set))) printf("\nPAPI_start_event failed. %s\n",PAPI_strerror(perr)); for(i=0;i<N;i++) r2[i]=-1.001; fpmaddv(N,v1,v2,v3,r2); if((perr=PAPI_read(ev_set,counts))) printf("PAPI_read failed. %s\n",PAPI_strerror(perr)); printf("Counts registered: "); for(i=0;i<NCOUNTS;i++) printf(" %12llu",counts[i]); printf("\n"); for(i=0;i<N;i++) { printf(" %g * %g + % g = %g (%g)\n", v1[i],v2[i],v3[i],r2[i],r1[i]); } for(i=0;i<N;i++) r2[i]=-1.001; printf("\nResetting the running counter and com-puting " "A(1:%1d)=B()*C()+D().\n",N); if((perr=PAPI_reset(ev_set))) printf("\nPAPI_reset failed. %s\n",PAPI_strerror(perr)); fpmaddv(N,v1,v2,v3,r2); if((perr=PAPI_stop(ev_set,counts))) printf("PAPI_stop failed. %s\n",PAPI_strerror(perr)); for(i=0;i<N;i++) { printf(" %g * %g + % g = %g (%g)\n", v1[i],v2[i],v3[i],r2[i],v1[i]*v2[i]+v3[i]); } printf("Testing to read stopped counters\n"); if((perr=PAPI_read(ev_set,counts)))

60

Blue Gene/L: Performance Analysis Tools

printf("PAPI_read failed. %s\n",PAPI_strerror(perr)); printf("Counts registered: "); for(i=0;i<NCOUNTS;i++) printf(" %12llu",counts[i]); printf("\n"); fine: PAPI_shutdown(); return 0

When looking at the output generated by the program, when executed in coprocessor mode (Example 4-4), there are no surprises. When run in virtual node mode (Example 4-5) the output has been compressed somewhat to make the results fit onto one page. In the virtual node mode case, the two processes (and the two cores) are running with no synchronization between the cores after the initial synchronization at PAPI library initialization. At each core, vectors of length eight are processed. This is the reason for detecting four double operations on the local FPU. The experiment illustrates that the counter reads are naturally synchronized only with the local program activity in the local core, unless specifically programmed to do so. In the illustrated output, process 0 and process 32, which ran on the same node with process 0 on core 0, apparently did not execute the first section of the test example simultaneously. That is because no counts were generated in the non-local FPU during the execution of the local floating point activity. The reason behind this is the serialization introduced by printouts to stdout from the processes. In the second part of the experiment, core1 did its local counter reset and reads so that it saw the events generated in FPU0. Example 4-4 shows sample output from the application shown in Example 4-3, running in coprocessor mode.
Example 4-4 Running the PAPI example code in coprocessor mode program is loading...ok program is running stdout[0]: stdout[0]: stdout[0]: 0 stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]:

Assigning a vector of length 8 and computing A()=B()*C()+D(). Counts registered: 9572 4 0 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 * * * * * * * * 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 + + + + + + + + 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 = = = = = = = = 5.0401 5.0804 5.1209 5.1616 5.2025 5.2436 5.2849 5.3264 (5.0401) (5.0804) (5.1209) (5.1616) (5.2025) (5.2436) (5.2849) (5.3264)

140

Resetting the running counter and computing A(1:8)=B()*C()+D(). 1.01 * 2.01 + 3.01 = 5.0401 (5.0401) 1.02 * 2.02 + 3.02 = 5.0804 (5.0804) 1.03 * 2.03 + 3.03 = 5.1209 (5.1209) 1.04 * 2.04 + 3.04 = 5.1616 (5.1616) 1.05 * 2.05 + 3.05 = 5.2025 (5.2025) 1.06 * 2.06 + 3.06 = 5.2436 (5.2436) 1.07 * 2.07 + 3.07 = 5.2849 (5.2849) 1.08 * 2.08 + 3.08 = 5.3264 (5.3264) Testing to read stopped counters

Chapter 4. Performance Application Programming Interface

61

stdout[0]: Counts registered: 0 Checking status program terminated successfully

8486

140

Example 4-5 shows output from the same application, but this time it is running in virtual node mode.
Example 4-5 Running the PAPI example in virtual node mode program is running stdout[32]: stdout[0]: stdout[32]: Assigning a vector of length 8 and computing A()=B()*C()+D(). stdout[0]: Assigning a vector of length 8 and computing A()=B()*C()+D(). stdout[32]: Counts registered: 9776 0 4 140 stdout[0]: Counts registered: 9664 4 0 0 stdout[32]: 1.01 * 2.01 + 3.01 = 5.0401 (5.0401) stdout[0]: 1.01 * 2.01 + 3.01 = 5.0401 (5.0401) stdout[32]: 1.08 * 2.08 + 3.08 = 5.3264 (5.3264) stdout[0]: 1.08 * 2.08 + 3.08 = 5.3264 (5.3264) stdout[32]: stdout[0]: stdout[32]: Resetting the running counter and computing A(1:8)=B()*C()+D(). stdout[0]: Resetting the running counter and computing A(1:8)=B()*C()+D(). stdout[32]: 1.01 * 2.01 + 3.01 = 5.0401 (5.0401) stdout[0]: 1.01 * 2.01 + 3.01 = 5.0401 (5.0401) stdout[32]: 1.08 * 2.08 + 3.08 = 5.3264 (5.3264) stdout[0]: 1.08 * 2.08 + 3.08 = 5.3264 (5.3264) stdout[32]: Testing to read stopped counters stdout[0]: Testing to read stopped counters stdout[32]: Counts registered: 8474 140 stdout[0]: Counts registered: 9638 128 Checking status program terminated successfully

0 140

4 4

4 0

188 140

A second test example is included here as well. In this example, a similar code (Example 4-6) is used, but a much larger number of counts is generated. This example illustrates the transparent 32-bit overflow protection in the performance counter API. In contrast to the previous example, the computation routine used here uses a standard FMA instruction and not the Blue Gene/L-specific FPMA instruction.

62

Blue Gene/L: Performance Analysis Tools

Example 4-6 PAPI example code exercising 32-bit overflow protection #include #include #include #include <stdio.h> <stdlib.h> "papi.h" "bgl_perfctr_events.h" // Use the FPM version of the com-putation

#undef FPMA

#define N 4000000 #define NITER 1100 #define NCOUNTS 5 int main(int argc, char* argv[]) { double v1[N], v2[N], v3[N], r1[N], r2[N]; double a=1.01,b=1.02,c=1.03,t=0.0,t2=0.0; int i, rank, iter; int perr, ev_set; int encoding; long_long counts[NCOUNTS]; #include "bglpersonality.h" #include "rts.h" if(PAPI_VER_CURRENT!=(perr=PAPI_library_init(PAPI_VER_CURRENT))) printf("PAPI_library_init failed. %s\n",PAPI_strerror(perr)); { BGLPersonality me; rts_get_personality(&me,sizeof(me)); if(me.xCoord != 0 ) goto fine; if(me.yCoord != 0 ) goto fine; if(me.zCoord != 0 ) goto fine; for(i=0;i<N;i++) { v1[i]=1.01+0.01*i; v2[i]=2.01+0.01*i; v3[i]=3.01+0.01*i; r1[i]=v1[i]*v2[i]+v3[i]; } for(i=0;i<N;i++) r2[i]=-1.001;

if((perr=PAPI_create_eventset(&ev_set))) printf("PAPI_create_eventset failed. %s\n",PAPI_strerror(perr)); if((perr=PAPI_add_event(&ev_set,PAPI_TOT_CYC))) printf("PAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_FPU_ARITH_TRINARY_OP & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_2NDFPU_ARITH_TRINARY_OP & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); encoding=( BGL_FPU_LDST_DBL_LD & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr));

Chapter 4. Performance Application Programming Interface

63

encoding=( BGL_2NDFPU_LDST_DBL_LD & 0x3FF ); if((perr=PAPI_add_event(&ev_set,encoding))) printf("\nPAPI_add_event failed. %s\n",PAPI_strerror(perr)); if((perr=PAPI_start(ev_set))) printf("PAPI_start_event failed. %s\n",PAPI_strerror(perr));

printf("\n\nPerforming %d iterations of vector operations for\n" "a total of %lld (0x%llx) number of FMAs\n", NITER,((long long)NITER)*N,((long long)NITER)*N); for(iter=0;iter<NITER;iter++) { if(iter%100==0) printf("\t----

Iteration %4.4d of %4.4d ----\n",iter,NITER);

#ifdef FPMA fpmaddv(N,v1,v2,v3,r2); #else fmaddv(N,v1,v2,v3,r2); #endif } if((perr=PAPI_stop(ev_set,counts))) printf("PAPI_stop failed. %s\n",PAPI_strerror(perr)); printf("Counts registered: "); for(i=0;i<NCOUNTS;i++) printf(" %12llu",counts[i]); printf("\n"); fine: PAPI_shutdown(); return 0; }

The counters used in this experiment are the time base register and the four FPU registers. The order of the events when printed is: 1. 2. 3. 4. 5. PAPI_TOT_CYC BGL_FPU_ARITH_TRINARY_OP BGL_2NDFPU_ARITH_TRINARY_OP BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_LDST_QUAD_LD

The experiment is set up to perform 4.4 x 109 trinary operations, which exceeds 232 as shown in the generated output in coprocessor mode (Example 4-7) as well as in virtual node mode (Example 4-8). The output illustrates that the library correctly protects against 32-bit convolution errors.

64

Blue Gene/L: Performance Analysis Tools

Example 4-7 Running the PAPI overflowing example code in coprocessor mode program is running stdout[0]: stdout[0]: stdout[0]: Performing 1100 iterations of vector operations for stdout[0]: a total of 4400000000 (0x10642ac00) number of FMAs stdout[0]: Time base: 915546797451 stdout[0]: ---- Iteration 0000 of 1100 ---stdout[0]: ---- Iteration 0100 of 1100 ---stdout[0]: ---- Iteration 0200 of 1100 ---stdout[0]: ---- Iteration 0300 of 1100 ---stdout[0]: ---- Iteration 0400 of 1100 ---stdout[0]: ---- Iteration 0500 of 1100 ---stdout[0]: ---- Iteration 0600 of 1100 ---stdout[0]: ---- Iteration 0700 of 1100 ---stdout[0]: ---- Iteration 0800 of 1100 ---stdout[0]: ---- Iteration 0900 of 1100 ---stdout[0]: ---- Iteration 1000 of 1100 ---stdout[0]: Counts registered: 85820449687 4400000000 0 Checking status program terminated successfully

0 13200000232

Example 4-8 Running the PAPI overflowing example code in virtual node mode program is running stdout[0]: stdout[32]: stdout[0]: stdout[32]: stdout[0]: Performing 1100 iterations of vector operations for stdout[32]: Performing 1100 iterations of vector operations for stdout[0]: a total of 4400000000 (0x10642ac00) number of FMAs stdout[32]: a total of 4400000000 (0x10642ac00) number of FMAs stdout[0]: ---- Iteration 0000 of 1100 ---stdout[32]: ---- Iteration 0000 of 1100 ---stdout[0]: ---- Iteration 0100 of 1100 ---stdout[32]: ---- Iteration 0100 of 1100 --- stdout[32]: ---- Iteration 1000 of 1100 ---stdout[0]: ---- Iteration 1000 of 1100 ---stdout[32]: Counts registered: 109898564159 4400000000 4400000000 13200000137 13200000174 stdout[0]: Counts registered: 109898570635 4400000000 4400000000 13200000246 13200000235 Checking status program terminated successfully

Chapter 4. Performance Application Programming Interface

65

4.2.2 bgl_perfctr usage example


Example 4-9 illustrates the usage of the lower-level substrate. To make visible the behavior of the internals of the library to different function calls, heavy use of the bgl_perfctr_dump_state() function is used. In normal operation, this function is not used, but it is helpful to illustrate the changes in the internal state of the control structure. The code in Example 4-9 performs the following operations at the time of each counter state dump: 1. 2. 3. 4. 5. 6. 7. 8. 9. Initializing the library Scheduling an event for addition Scheduling a second event for addition Committing pending configuration changes Scheduling an event for removal Revoking pending changes Updating a virtual counter to establish a counter baseline Updating a second virtual counter to see the number of counts that were aggregated Updating a third virtual counter to increment the virtual counters with events since the last update

Example 4-9 The bgl_perfctr example code #include <stdio.h> #include <stdlib.h> #include "bgl_perfctr.h" #include "bgl_perfctr_events.h" #define EV1 BGL_UPC_L3_CACHE_HIT #define EV2 BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_REQED_DDR //#define EV1 BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR //#define EV2 BGL_UPC_L3_EDRAM_ACCESS_CYCLE int main() { bgl_perfctr_control_t *hwctrs; BGL_PERFCTR_event_t ev; int i,n,err,rank; int *memarea; #include "bglpersonality.h" #include "rts.h" { BGLPersonality me; rts_get_personality(&me,sizeof(me)); if(me.xCoord != 0 ) goto fine; if(me.yCoord != 0 ) goto fine; if(me.zCoord != 0 ) goto fine; } if(bgl_perfctr_init()) abort(); bgl_perfctr_dump_state(stdout); ev.edge=0x1; ev.num=EV1; err=bgl_perfctr_add_event(ev); if(err) { printf("Add event line %d failed.\n",__LINE__-2); exit(1); } else printf("One event added. %s\n",BGL_PERFCTR_event_table[EV1].event_name);

66

Blue Gene/L: Performance Analysis Tools

bgl_perfctr_dump_state(stdout); ev.num=EV2; err=bgl_perfctr_add_event(ev); if(err) { printf("Add event line %d failed.\n",__LINE__-2); exit(1); } else printf("One more event added. %s\n",BGL_PERFCTR_event_table[EV2].event_name); bgl_perfctr_dump_state(stdout); err=bgl_perfctr_commit(); if(err) { printf("Commit %d failed.\n",__LINE__-2); exit(1); } else printf("Commit successful.\n"); bgl_perfctr_dump_state(stdout); ev.num=EV1; err=bgl_perfctr_remove_event(ev); if(err) { printf("Remove %d failed.\n",__LINE__-2); exit(1); } else printf("Remove successful.\n"); bgl_perfctr_dump_state(stdout); err=bgl_perfctr_revoke(); if(err) { printf("Commit %d failed.\n",__LINE__-2); exit(1); } else printf("Commit successful.\n"); bgl_perfctr_dump_state(stdout); printf("\n\n----------------------\n\n"); printf("\n bgl_perfctr_update \n"); bgl_perfctr_update(); bgl_perfctr_dump_state(stdout); n=1024*1024; memarea=(int *) malloc(1024*1024*sizeof(int)); for(i=0;i<n;i++) memarea[i]=n-1; printf("\n bgl_perfctr_update again after loop\n"); bgl_perfctr_update(); bgl_perfctr_dump_state(stdout); for(i=0;i<n;i++) memarea[i]-=1; printf("\n bgl_perfctr_update again after loop\n"); bgl_perfctr_update(); bgl_perfctr_dump_state(stdout); if(bgl_perfctr_shutdown()) abort();

Chapter 4. Performance Application Programming Interface

67

fine: return 0;

Example 4-10 shows the output from running the program in Example 4-9.
Example 4-10 Running the bgl_perfctr example code program is running stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: c-mode=0 stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: stdout[0]: -------- bgl_perfctr_dump_state ------0 defined events. in_use=0x00000000 modified=0x00000000 Id code - Interpretation UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ 0: 0x00000000 0 0 - | 0 0 - | 0 0 1: 0x00000000 2: 0x00000000 3: 0x00000000 4: 0x00000000 5: 0x00000000 6: 0x00000000 7: 0x00000000 8: 0x00000000 9: 0x00000000 10: 0x00000000 11: 0x00000000 12: 0x00000000 13: 0x00000000 14: 0x00000000 15: 0x00000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | | | | | | | | | | | | | | | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | | | | | | | | | | | | | | | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -

FPU Hummer ARITH: Act Code | LD/ST: Act Code 16: 0x00000000 0 0 | 0 0 FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code 17: 0x00000000 0 0 | 0 0 Id Event H/W CtrlReg RefCount NewCount Current cached values in the active counters Last Virtual One event added. BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR -------- bgl_perfctr_dump_state ------1 defined events. in_use=0x00000000 modified=0x00000010 Id code - Interpretation UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ

68

Blue Gene/L: Performance Analysis Tools

stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 M 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 0 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: One more event added. BGL_UPC_PU0_DCURD_WAIT_L3 stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00000000 modified=0x00000050 stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 M 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 M 0 0 - | 0 5 - | 0 0 c-mode=0

Chapter 4. Performance Application Programming Interface

69

stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 0 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 0 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: Commit successful. stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00082000 modified=0x00000000 stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 0 0 - | 0 5 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0

70

Blue Gene/L: Performance Analysis Tools

stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 1 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 1 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: 13: 0 0 stdout[0]: 19: 0 0 stdout[0]: Remove successful. stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00082000 modified=0x00000000 stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 0 0 - | 0 5 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount

Chapter 4. Performance Application Programming Interface

71

stdout[0]: 0: 17 13 4 1 0 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 1 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: 13: 0 0 stdout[0]: 19: 0 0 stdout[0]: Revoke successful. stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00082000 modified=0x00000000 stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 0 0 - | 0 5 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 1 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 1 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: 13: 0 0 stdout[0]: 19: 0 0 stdout[0]: stdout[0]: bgl_perfctr_update stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00082000 modified=0x00000000

72

Blue Gene/L: Performance Analysis Tools

stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 0 0 - | 0 5 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 1 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 1 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: 13: 8318 8318 stdout[0]: 19: 231293 231293 stdout[0]: stdout[0]: bgl_perfctr_update again after loop stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00082000 modified=0x00000000 stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0

Chapter 4. Performance Application Programming Interface

73

stdout[0]: 4: 0x00081000 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 0 0 - | 0 5 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 1 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 1 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: 13: 133235 133235 stdout[0]: 19: 1727334 1727334 stdout[0]: stdout[0]: bgl_perfctr_update again after loop stdout[0]: -------- bgl_perfctr_dump_state ------stdout[0]: 2 defined events. in_use=0x00082000 modified=0x00000000 stdout[0]: Id code - Interpretation stdout[0]: UPC events A: edge code IRQ | B: edge code IRQ | C: edge code IRQ stdout[0]: 0: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 1: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 2: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 3: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 4: 0x00081000 0 0 - | 1 1 - | 0 0 c-mode=0 stdout[0]: 5: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 6: 0x00005000 0 0 - | 0 5 - | 0 0 c-mode=0 stdout[0]: 7: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 8: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0

74

Blue Gene/L: Performance Analysis Tools

stdout[0]: 9: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 10: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 11: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 12: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 13: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 14: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: 15: 0x00000000 0 0 - | 0 0 - | 0 0 c-mode=0 stdout[0]: FPU Hummer ARITH: Act Code | LD/ST: Act Code stdout[0]: 16: 0x00000000 0 0 | 0 0 stdout[0]: FPU Hummer CPU2 ARITH: Act Code | LD/ST: Act Code stdout[0]: 17: 0x00000000 0 0 | 0 0 stdout[0]: Id Event H/W CtrlReg RefCount NewCount stdout[0]: 0: 17 13 4 1 1 (BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_DDR) stdout[0]: 1: 66 19 6 1 1 (BGL_UPC_PU0_DCURD_WAIT_L3) stdout[0]: Current cached values in the active counters stdout[0]: Last Virtual stdout[0]: 13: 170127 170127 stdout[0]: 19: 2064954 2064954

Checking status program terminated successfully

4.3 Conclusion
This chapter detailed the implementation of the user APIs to access and control hardware performance counters on Blue Gene/L. The APIs consist of two libraries: bgl_perfctr and PAPI. Bgl_perfctr is a low-level abstraction that unifies the behavior of the different counter sources into a single abstraction and takes care of 64-bit virtualization and automatic overflow protection of virtual event counters. The bgl_perfctr API is intended to reflect the hardware implementation of performance counters in a user-friendly way, without hiding the details of this hardware implementation API. Examples in 4.2, Examples of using hardware performance monitor libraries for Blue Gene/L on page 58, illustrate the virtualization to 64-bit counters and the 32-bit overflow protection. PAPI is a higher-level abstraction that aims to make hardware counter access uniform between different computer platforms using different CPU architectures and from different vendors. This chapter presented specific details about PAPI when implemented on Blue Gene/L including newly introduced PAPI preset events for Blue Gene/L and minor changes to library behavior that are pertinent to Blue Gene/L. In 4.2, Examples of using hardware performance monitor libraries for Blue Gene/L on page 58, we demonstrated start, stop, read and reset of 64-bit virtual counters as well as the ability to correctly register events in excess of 232.

Chapter 4. Performance Application Programming Interface

75

76

Blue Gene/L: Performance Analysis Tools

Appendix A.

Statement of completion
IBM considers installation and integration services complete when the following activities have taken place: Service Node powers on and off and reports the system status. Rack and system diagnostic runs have completed. The ability of the Front End Node to submit the Linpack application to a target 512 Compute Node partition has been demonstrated. Linpack has run on a maximum system partition. The ability to submit multiple Linpack jobs to multiple partitions simultaneously has been demonstrated. The ability to route Ethernet traffic to a destination TCP/IP has been demonstrated.

Copyright IBM Corp. 2006. All rights reserved.

77

78

Blue Gene/L: Performance Analysis Tools

Appendix B.

Electromagnetic compatibility
This chapter provides important electromagnetic compatibility information about Blue Gene/L in various geographic countries or regions around the world.
European Union Electromagnetic Compatibility Directive This product is in conformity with the protection requirements of EU Council Directive 89/336/EEC on the approximation of the laws of the Member States relating to electromagnetic compatibility. IBM cannot accept responsibility for any failure to satisfy the protection requirements resulting from a non-recommended modification of the product, including the fitting of non-IBM option cards. This Class A digital apparatus complies with Canadian ICES-003. Cet appareil numrique de la classe A est conform la norme NMB-003 du Canada Attention: This is a Class A product. In a domestic environment, this product may cause radio interference in which case the user may be required to take adequate measures. European Union Class A This product has been tested and found to comply with the limits for Class A Information Technology Equipment according to European Standard EN 55022. The limits for Class A equipment were derived for commercial and industrial environments to provide reasonable protection against interference with licensed communication equipment. Properly shielded and grounded cables and connectors must be used in order to reduce the potential for causing interference to radio and TV communications and to other electrical or electronic equipment. IBM cannot accept responsibility for any interference caused by using other than recommended cables and connectors. Japan - VCCI Class A

Canada

Copyright IBM Corp. 2006. All rights reserved.

79

Korean

, . Federal Communications Commission (FCC) Statement: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference to radio communications. Operation of this equipment in a residential area is likely to cause harmful interference in which case the user will be required to correct the interference at his own expense. Properly shielded and grounded cables and connectors must be used in order to meet FCC emission limits. IBM is not responsible for any radio or television interference caused by using other than recommended cables and connectors or by unauthorized changes or modifications to this equipment. Unauthorized changes or modifications could void the user's authority to operate the equipment. This device complies with Part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation.

United States FCC class A

80

Blue Gene/L: Performance Analysis Tools

Appendix C.

Perfmon database table specifications


This appendix provides details about the database tables that are used by the Perfmon facility.

Copyright IBM Corp. 2006. All rights reserved.

81

Database organization
The External Performance Instrumentation Facility can optionally store results in the Midplane Management Control System (MMCS) performance database that is made up of seven SQL database tables: BGLPERFINST BGLPERFDEF BGLPERFDESC BGLPERFJOB BGLPERFLOCATION BGLPERFSAMPLES BGLPERFDATA The majority of the data that is collected resides in the BGLPERFDATA table. The other tables join to BGLPERFDATA, and to each other in various ways, to allow for the data to be easily accessed and managed. The basic function provided by the schema permits the collected data to be related to individual nodes, individual applications, individual periods of time on the system, and unique instances or uses of the perfmon command. The following sections describe the attributes for these tables.

Performance collection instance table: BGLPERFINST


The purpose of the BGLPERFINST table (Table C-1) is to provide summary information about the data that is collected each time the performance monitor is started. The record details, among other things, when the monitor was started and when the monitor was stopped and the range of JOB_ID about which performance information is recorded. It provides summary information about the number of records that were recorded, which can be used to manage the volume of data. It can be used to gain access to the data that is collected in each start instance and describes in a summary sense the type of data that was collected through its DEFINITION_ID field.
Table C-1 BGLPERFINST table Column name INSTANCE_ID Data type Integer Column description Each time a perfmon command is processed, a new measurement instance is created and the measurements taken for that instance become associated with a new INSTANCE_ID. This is an auto-increment field. The definition ID is used when collecting data for this instance. For version 2, this column is always set to -1. A version number identifies the modification level of the facility under which the measurement was taken. Version 1 is any release prior to Blue Gene/L V1R3.Version 2 starts with Blue Gene/L V1R3. The first job ID is for measurements that were collected in this measurement instance. The clock time indicates when this measurement was started. The last job ID is for measurements that were collected in this measurement instance. The clock time indicates when this measurement was ended.

DEFINITION_ID VERSION

Integer Integer

TIME_START JOB_START TIME_STOP JOB_STOP

Time stamp Integer Time stamp Integer

82

Blue Gene/L: Performance Analysis Tools

Column name RECORDS

Data type Big integer

Column description The number of records are written to the BGLPERFDATA table between the period of time defined by TIME_START and TIME_STOP. This value can be used to calculate the approximate volume of data in bytes collected during this measurement instance. An indicator specifies whether detailed or summary samples were collected for this instance. The values are: D for detailed S for summary All version 1 instances are detailed. The sample interval, in seconds, is used for this instance. This value is null for all version 1 instances. This column indicates any predicates that provided selectivity for the jobs to be monitored. This value is null for all version 1 instances.

SAMPLE_TYPE

Char(1)

SAMPLE_INTERVAL JOB_PREDICATES

Big integer Varchar(1024)

Performance definition table: BGLPERFDEF


Table C-2 describes the performance counter definition IDs.
Table C-2 BGLPERFDEF table Column name DEFINITION_ID Data type Integer Column description An integer value uniquely identifies the measurement definition with which the associated counter ID and event edge, in this record, are to be associated. The performance measurement COUNTER_ID for the performance measurement is associated with the DEFINITION_ID specified in this record. An indicator specifies when the performance counter associated with this measurement is to be incremented.

COUNTER_ID

Integer

EVENT_EDGE

Integer

Performance description table: BGLPERFDESC


The BGLPERFDESC table (Table C-3) provides both a short and a long text description for the specific counters that can be measured on Blue Gene/L. The table has x number of records where each record lists a COUNTER_ID and a short and a long text description. The table is normally joined to on COUNTER_ID to provide meaningful column headings for tables of counter data queried from the database
Table C-3 BGLPERFDESC table Column name COUNTER_ID NAME DESCRIPTION Data type Column description The performance COUNTER_ID is associated with the indicated measurement name and description in this record. The short name of the of the measurement is specified by the COUNTER_ID. A text description of the measurement is specified by the COUNTER_ID.

Integer Char(128) Char(230)

Appendix C. Perfmon database table specifications

83

Performance job table: BGLPERFJOB


The BGLPERFJOB table (Table C-4) contains summary information about each job for which data has been collected. Records in the BGLPERFJOB table can be joined to by INSTANCE_ID and by JOB_ID. Each BGLPERFJOB record contains information about the size of the job in terms of the number of X, Y, and Z nodes that it involved, and the start time and stop time of the job. This record also describes the number of samples that were taken for the job and an indication as to whether the last sample contains valid data.
Table C-4 BGLPERFJOB table Column name JOB_ID INSTANCE_ID SIZEX SIZEY SIZEZ MODE Data type Integer Integer Integer Integer Integer Char(1) Column description MMCS JOB_ID of the job for which measurements were collected The INSTANCE_ID under which measurements for the specified job were collected The number of nodes in the X torus direction for the job The number of nodes in the Y torus direction for the job The number of nodes in the Z torus direction for the job The mode under which the job ran: C for coprocessor V for virtual node mode The time at which the first measurement interval was collected for the job The time at which the last measurement interval was collected for the job The number of samples collected for the job A flag that indicates whether the data collected at the last interval was judged to be valid. The values are: T for valid F for not valid This attribute is always T for version 2.

TIME_START TIME_STOP SAMPLES VALID

Time stamp Time stamp Integer Char(1)

Performance location table: BGLPERFLOCATION


The BGLPERFLOCATION table (Table C-5) contains records that identify the nodes that make up each job for which data was collected. Jobs are identified by the MMCS job ID, which can be used to query the MMCS database. The intent of this table is to tie collected data back to a specific location in the Blue Gene/L machine and to the MMCS job history information.
Table C-5 BGLPERFLOCATION table Column name JOB_ID NODE_ID X_COORD Y_COORD Data type Integer Integer Integer Integer Column description MMCS JOB_ID of the job for which measurements were collected The MPI RANK of a node in the job for which measurement data was collected The number of nodes in the X torus direction for the job The number of nodes in the Y torus direction for the job

84

Blue Gene/L: Performance Analysis Tools

Column name Z_COORD LOCATION

Data type Integer Char(32)

Column description The number of nodes in the Z torus direction for the job The node location that identifies the rack, midplane, node board, and node placement

Performances samples definition table: BGLPERFSAMPLES


Table C-6 describes the performances samples definition table.
Table C-6 BGLPERFSAMPLES table Column name JOB_ID SAMPLE_NUM TIME VALID Data type Integer Integer Time stamp Char(1) Column description MMCS JOB_ID of the job for which measurements were collected The sample number for the measurement in the job The time at which the data for sample was collected for the job A flag that indicates whether the collected data is valid. Values are T and F, representing True and False respectively. For version 1, the value indicates if the sample is valid. For version 2, the value is always T. The INSTANCE_ID under which measurements for the specified job were collected; for version 1, this value is null The counter definition ID used to collect this sample; for version 1, this value is null

INSTANCE_ID DEFINITION_ID

Integer Integer

Performance data file table: BGLPERFDATA


Table C-7 describes the performance data file table.
Table C-7 BGLPERFDATA table Column name NODE_ID JOB_ID COUNTER_ID SAMPLE_NUM VALUE INSTANCE_ID Data type Integer Integer Integer Integer Big Integer Integer Column description The NODE_ID of the measurement in the job MMCS JOB_ID of the job for which measurements were collected The COUNTER_ID of the measurement The SAMPLE_NUM for the measurement in the job The counter value for the measurement The INSTANCE_ID under which measurements for the specified job were collected; for version 1, this value is null

Appendix C. Perfmon database table specifications

85

BGLPERFDESC table
The actual row values for the BGLPERFDESC table are included in Table C-8 to show the possible counters to be monitored. To link the performance monitoring capabilities (counter definition ID 0) into an application, specify the following line on the link line as part of the compile:
-lbgl_perfctr.rts

This option should be listed first.


Table C-8 BGLPERFDESC table Counter ID 0 1 2 3 Counter name BGL_FPU_ARITH_ADD_SUBTRACT BGL_FPU_ARITH_MULT_DIV BGL_FPU_ARITH_OEDIPUS_OP BGL_FPU_ARITH_TRINARY_OP Counter description Add and subtract, add, fads, fussy, fusses (Book E add, subtract) Multiplication and divisions, formula, films, vaudeville, vaudeville (Book E mull, did) Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Ternary operations, famed, fends, famed, fends, famous, fumbles, famous, funguses (Book E famed) Double loads, led, lefts, left, lefts, livest, lfsdux (double word loads, no single precision) Double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Second FPU add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Second FPU multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Second FPU Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Second FPU trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Second FPU double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Second FPU double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision)

4 5 6 7 8 9 10

BGL_FPU_LDST_DBL_LD BGL_FPU_LDST_DBL_ST BGL_FPU_LDST_QUAD_LD BGL_FPU_LDST_QUAD_ST BGL_2NDFPU_ARITH_ADD_SUBTRACT BGL_2NDFPU_ARITH_MULT_DIV BGL_2NDFPU_ARITH_OEDIPUS_OP

11

BGL_2NDFPU_ARITH_TRINARY_OP

12

BGL_2NDFPU_LDST_DBL_LD

13

BGL_2NDFPU_LDST_DBL_ST

86

Blue Gene/L: Performance Analysis Tools

Counter ID 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

Counter name BGL_2NDFPU_LDST_QUAD_LD BGL_2NDFPU_LDST_QUAD_ST BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_ WAY_DDR BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_L3_EDRAM_ACCESS_CYCLE BGL_UPC_L3_EDRAM_RFR_CYCLE BGL_UPC_L3_LINE_STARTS_EVICT_LINE_ NUM_PRESSURE BGL_UPC_L3_MISS_DIR_SET_DISBL BGL_UPC_L3_MISS_NO_WAY_SET_AVAIL BGL_UPC_L3_MISS_REQUIRING_CASTOUT BGL_UPC_L3_MISS_REQUIRING_REFILL_NO_ WR_ALLOC BGL_UPC_L3_MSHNDLR_TOOK_REQ BGL_UPC_L3_MSHNDLR_TOOK_REQ_PLB_RDQ BGL_UPC_L3_MSHNDLR_TOOK_REQ_RDQ0 BGL_UPC_L3_MSHNDLR_TOOK_REQ_RDQ1 BGL_UPC_L3_MSHNDLR_TOOK_REQ_WRBUF BGL_UPC_L3_PAGE_CLOSE BGL_UPC_L3_PAGE_OPEN BGL_UPC_L3_PLB_WRQ_DEP_DBUF BGL_UPC_L3_PLB_WRQ_DEP_DBUF_HIT BGL_UPC_L3_PREF_REINS_PULL_OUT_NEXT_ LINE BGL_UPC_L3_PREF_REQ_ACC_BY_PREF_UNIT BGL_UPC_L3_RD_BURST_1024B_LINE_RD BGL_UPC_L3_RD_EDR__ALL_KINDS_OF_RD BGL_UPC_L3_RD_MODIFY_WR_CYCLE_EDR BGL_UPC_L3_REQ_TKN_CACHE_INHIB_RD_REQ BGL_UPC_L3_REQ_TKN_CACHE_INHIB_WR BGL_UPC_L3_REQ_TKN_NEEDS_CASTOUT

Counter description Second FPU quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Second FPU quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Cache hit Cache miss; data already on the way from DDR Cache miss; data will be requested from DDR EDRAM access cycle EDRAM refresh cycle Line starts to evict due to line numb pressure Miss; but this directory set is disabled Miss and no way in this set is available Miss requiring a castout Miss requiring a refill (no write allocation) Miss handler took request Miss handler took request from PLB read queue Miss handler took request from read queue 0 Miss handler took request from read queue 1 Miss handler took request from write buffer Page close occurred Page open occurred PLB write queue deposits data into buffer PLB write queue deposits data into buffer (hit) Prefetch reinserted to pull out next line Prefetch request accepted by prefetch unit Read burst (1024b line read) occurred Read from EDR occurred (all kinds of read) Read-modify-write cycle to EDR occurred Request taken is a cache inhibited read request Request taken is a cache inhibited write Request taken needs castout

Appendix C. Perfmon database table specifications

87

Counter ID 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

Counter name BGL_UPC_L3_REQ_TKN_NEEDS_REFILL BGL_UPC_L3_WRBUF_LINE_ALLOC BGL_UPC_L3_WRQ0_DEP_DBUF BGL_UPC_L3_WRQ0_DEP_DBUF_HIT BGL_UPC_L3_WRQ1_DEP_DBUF BGL_UPC_L3_WRQ1_DEP_DBUF_HIT BGL_UPC_L3_WR_EDRAM__INCLUDING_RMW BGL_UPC_PU0_DCURD_1_RD_PEND BGL_UPC_PU0_DCURD_2_RD_PEND BGL_UPC_PU0_DCURD_3_RD_PEND BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_COHERENCY_ STALL_WAR BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_L3_REQ_PEND BGL_UPC_PU0_DCURD_LINK_REQ BGL_UPC_PU0_DCURD_LINK_REQ_PEND BGL_UPC_PU0_DCURD_LOCK_REQ BGL_UPC_PU0_DCURD_LOCK_REQ_PEND BGL_UPC_PU0_DCURD_PLB_REQ BGL_UPC_PU0_DCURD_PLB_REQ_PEND BGL_UPC_PU0_DCURD_RD_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_DCURD_SRAM_REQ_PEND BGL_UPC_PU0_DCURD_WAIT_L3 BGL_UPC_PU0_DCURD_WAIT_LINK BGL_UPC_PU0_DCURD_WAIT_LOCK BGL_UPC_PU0_DCURD_WAIT_PLB BGL_UPC_PU0_DCURD_WAIT_SRAM BGL_UPC_PU0_PREF_FILTER_HIT BGL_UPC_PU0_PREF_PREF_PEND BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT

Counter description Request taken needs refill Write buffer line was allocated Write queue 0 deposits data into buffer Write queue 0 deposits data into buffer (hit) Write queue 1 deposits data into buffer Write queue 1 deposits data into buffer (hit) Write to EDRAM occurred (including RMW) DCURD 1 read pending DCURD 2 reads pending DCURD 3 reads pending DCURD BLIND request DCURD coherency stall (WAR) DCURD L3 request DCURD L3 request pending DCURD LINK request DCURD LINK request pending DCURD LOCK request DCURD LOCK request pending DCURD PLB request DCURD PLB request pending DCURD read request DCURD SRAM request DCURD SRAM request pending DCURD wait for L3 DCURD wait for LINK DCURD wait for LOCK DCURD wait for PLB DCURD wait for SRAM Prefetch filter hit Prefetch prefetch pending Prefetch request valid Prefetch self hit

88

Blue Gene/L: Performance Analysis Tools

Counter ID 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106

Counter name BGL_UPC_PU0_PREF_SNOOP_HIT_OTHER BGL_UPC_PU0_PREF_SNOOP_HIT_PLB BGL_UPC_PU0_PREF_SNOOP_HIT_SAME BGL_UPC_PU0_PREF_STREAM_HIT BGL_UPC_PU1_DCURD_1_RD_PEND BGL_UPC_PU1_DCURD_2_RD_PEND BGL_UPC_PU1_DCURD_3_RD_PEND BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_COHERENCY_ STALL_WAR BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_L3_REQ_PEND BGL_UPC_PU1_DCURD_LINK_REQ BGL_UPC_PU1_DCURD_LINK_REQ_PEND BGL_UPC_PU1_DCURD_LOCK_REQ BGL_UPC_PU1_DCURD_LOCK_REQ_PEND BGL_UPC_PU1_DCURD_PLB_REQ BGL_UPC_PU1_DCURD_PLB_REQ_PEND BGL_UPC_PU1_DCURD_RD_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_DCURD_SRAM_REQ_PEND BGL_UPC_PU1_DCURD_WAIT_L3 BGL_UPC_PU1_DCURD_WAIT_LINK BGL_UPC_PU1_DCURD_WAIT_LOCK BGL_UPC_PU1_DCURD_WAIT_PLB BGL_UPC_PU1_DCURD_WAIT_SRAM BGL_UPC_PU1_PREF_FILTER_HIT BGL_UPC_PU1_PREF_PREF_PEND BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_PU1_PREF_SNOOP_HIT_OTHER BGL_UPC_PU1_PREF_SNOOP_HIT_PLB BGL_UPC_PU1_PREF_SNOOP_HIT_SAME

Counter description Prefetch snoop hit other Prefetch snoop hit PLB Prefetch snoop hit same Prefetch stream hit DCURD 1 read pending DCURD 2 reads pending DCURD 3 reads pending DCURD BLIND request DCURD coherency stall (WAR) DCURD L3 request DCURD L3 request pending DCURD LINK request DCURD LINK request pending DCURD LOCK request DCURD LOCK request pending DCURD PLB request DCURD PLB request pending DCURD read request DCURD SRAM request DCURD SRAM request pending DCURD wait for L3 DCURD wait for LINK DCURD wait for LOCK DCURD wait for PLB DCURD wait for SRAM Prefetch filter hit Prefetch prefetch pending Prefetch request valid Prefetch self hit Prefetch snoop hit other Prefetch snoop hit PLB Prefetch snoop hit same

Appendix C. Perfmon database table specifications

89

Counter ID 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

Counter name BGL_UPC_PU1_PREF_STREAM_HIT BGL_UPC_TI_TESTINT_ERR_EVENT BGL_UPC_TR_ARB_CH2_VC0_HAVE BGL_UPC_TR_ARB_CH1_VC0_HAVE BGL_UPC_TR_ARB_CH0_VC0_HAVE BGL_UPC_TR_ARB_INJ_VC0_HAVE BGL_UPC_TR_ARB_CH2_VC1_HAVE BGL_UPC_TR_ARB_CH1_VC1_HAVE BGL_UPC_TR_ARB_CH0_VC1_HAVE BGL_UPC_TR_ARB_INJ_VC1_HAVE BGL_UPC_TR_ARB_CORE_CH2_VC0_MATURE BGL_UPC_TR_ARB_CORE_CH1_VC0_MATURE BGL_UPC_TR_ARB_CORE_CH0_VC0_MATURE BGL_UPC_TR_ARB_CORE_INJ_VC0_MATURE BGL_UPC_TR_ARB_CORE_CH2_VC1_MATURE BGL_UPC_TR_ARB_CORE_CH1_VC1_MATURE BGL_UPC_TR_ARB_CORE_CH0_VC1_MATURE BGL_UPC_TR_ARB_CORE_INJ_VC1_MATURE BGL_UPC_TR_ARB_CORE_GREEDY_MODE BGL_UPC_TR_ARB_CORE_REQ_PEND BGL_UPC_TR_ARB_CORE_REQ_WAITING_ RDY_GO BGL_UPC_TR_ARB_CLASS0_WINS BGL_UPC_TR_ARB_CLASS1_WINS BGL_UPC_TR_ARB_CLASS2_WINS BGL_UPC_TR_ARB_CLASS3_WINS BGL_UPC_TR_ARB_CLASS4_WINS BGL_UPC_TR_ARB_CLASS5_WINS BGL_UPC_TR_ARB_CLASS6_WINS BGL_UPC_TR_ARB_CLASS7_WINS BGL_UPC_TR_ARB_CLASS8_WINS BGL_UPC_TR_ARB_CLASS9_WINS BGL_UPC_TR_ARB_CLASS10_WINS

Counter description Prefetch stream hit Testint error event Arbiter ch2_vc0_have Arbiter ch1_vc0_have Arbiter ch0_vc0_have Arbiter inj_vc0_have Arbiter ch2_vc1_have Arbiter ch1_vc1_have Arbiter ch0_vc1_have Arbiter inj_vc1_have Arbiter_core ch2_vc0_mature Arbiter_core ch1_vc0_mature Arbiter_core ch0_vc0_mature Arbiter_core inj_vc0_mature Arbiter_core ch2_vc1_mature Arbiter_core ch1_vc1_mature Arbiter_core ch0_vc1_mature Arbiter_core inj_vc1_mature Arbiter_core greedy_mode Arbiter_core requests pending Arbiter_core requests waiting (ready to go) Arbiter class 0 wins Arbiter class 1 wins Arbiter class 2 wins Arbiter class 3 wins Arbiter class 4 wins Arbiter class 5 wins Arbiter class 6 wins Arbiter class 7 wins Arbiter class 8 wins Arbiter class 9 wins Arbiter class 10 wins

90

Blue Gene/L: Performance Analysis Tools

Counter ID 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170

Counter name BGL_UPC_TR_ARB_CLASS11_WINS BGL_UPC_TR_ARB_CLASS12_WINS BGL_UPC_TR_ARB_CLASS13_WINS BGL_UPC_TR_ARB_CLASS14_WINS BGL_UPC_TR_ARB_CLASS15_WINS BGL_UPC_TR_ARB_SNDR2_BUSY BGL_UPC_TR_ARB_SNDR1_BUSY BGL_UPC_TR_ARB_SNDR0_BUSY BGL_UPC_TR_ARB_LOCAL_CLIENT_BUSY_REC BGL_UPC_TR_ARB_RCV2_BUSY BGL_UPC_TR_ARB_RCV1_BUSY BGL_UPC_TR_ARB_RCV0_BUSY BGL_UPC_TR_ARB_LOCAL_CLIENT_BUSY_INJ BGL_UPC_TR_ARB_ALU_BUSY BGL_UPC_TR_ARB_RCV2_ABORT BGL_UPC_TR_ARB_RCV1_ABORT BGL_UPC_TR_ARB_RCV0_ABORT BGL_UPC_TR_ARB_LOCAL_CLIENT_ABORT BGL_UPC_TR_ARB_RCV2_PKT_TKN BGL_UPC_TR_ARB_RCV1_PKT_TKN BGL_UPC_TR_ARB_RCV0_PKT_TKN BGL_UPC_TR_ARB_LOCAL_CLIENT_PKT_TKN BGL_UPC_TR_RCV_0_VC0_DPKT_RCV BGL_UPC_TR_RCV_0_VC1_DPKT_RCV BGL_UPC_TR_RCV_0_VC0_EMPTY_PKT BGL_UPC_TR_RCV_0_VC1_EMPTY_PKT BGL_UPC_TR_RCV_0_IDLPKT BGL_UPC_TR_RCV_0_KNOWN_BAD_PKT_ MARKER BGL_UPC_TR_RCV_0_VC0_CUT_THROUGH BGL_UPC_TR_RCV_0_VC1_CUT_THROUGH BGL_UPC_TR_RCV_0_VC0_FULL BGL_UPC_TR_RCV_0_VC1_FULL

Counter description Arbiter class 11 wins Arbiter class 12 wins Arbiter class 13 wins Arbiter class 14 wins Arbiter class 15 wins Arbiter sender 2 busy Arbiter sender 1 busy Arbiter sender 0 busy Arbiter local client busy (reception) Arbiter receiver 2 busy Arbiter receiver 1 busy Arbiter receiver 0 busy Arbiter local client busy (injection) Arbiter alu busy Arbiter receiver 2 abort Arbiter receiver 1 abort Arbiter receiver 0 abort Arbiter local client abort Arbiter receiver 2 packet taken Arbiter receiver 1 packet taken Arbiter receiver 0 packet taken Arbiter local client packet taken Receiver 0 vc0 data packet received Receiver 0 vc1 data packet received Receiver 0 vc0 empty packet Receiver 0 vc1 empty packet Receiver 0 IDLE packet Receiver 0 known-bad-packet marker Receiver 0 vc0 cut-through Receiver 0 vc1 cut-through Receiver 0 vc0 full Receiver 0 vc1 full

Appendix C. Perfmon database table specifications

91

Counter ID 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201

Counter name BGL_UPC_TR_RCV_0_HDR_PARITY_ERR BGL_UPC_TR_RCV_0_CRC_ERR BGL_UPC_TR_RCV_0_UNEXPCT_HDR_ERR BGL_UPC_TR_RCV_0_RESYNCH_MODE_ AFTER_ERR BGL_UPC_TR_RCV_0_SRAM_ERR_CORR BGL_UPC_TR_RCV_1_VC0_DPKT_RCV BGL_UPC_TR_RCV_1_VC1_DPKT_RCV BGL_UPC_TR_RCV_1_VC0_EMPTY_PKT BGL_UPC_TR_RCV_1_VC1_EMPTY_PKT BGL_UPC_TR_RCV_1_IDLPKT BGL_UPC_TR_RCV_1_KNOWN_BAD_PKT_ MARKER BGL_UPC_TR_RCV_1_VC0_CUT_THROUGH BGL_UPC_TR_RCV_1_VC1_CUT_THROUGH BGL_UPC_TR_RCV_1_VC0_FULL BGL_UPC_TR_RCV_1_VC1_FULL BGL_UPC_TR_RCV_1_HDR_PARITY_ERR BGL_UPC_TR_RCV_1_CRC_ERR BGL_UPC_TR_RCV_1_UNEXPCT_HDR_ERR BGL_UPC_TR_RCV_1_RESYNCH_MODE_ AFTER_ERR BGL_UPC_TR_RCV_1_SRAM_ERR_CORR BGL_UPC_TR_RCV_2_VC0_DPKT_RCV BGL_UPC_TR_RCV_2_VC1_DPKT_RCV BGL_UPC_TR_RCV_2_VC0_EMPTY_PKT BGL_UPC_TR_RCV_2_VC1_EMPTY_PKT BGL_UPC_TR_RCV_2_IDLPKT BGL_UPC_TR_RCV_2_KNOWN_BAD_PKT_ MARKER BGL_UPC_TR_RCV_2_VC0_CUT_THROUGH BGL_UPC_TR_RCV_2_VC1_CUT_THROUGH BGL_UPC_TR_RCV_2_VC0_FULL BGL_UPC_TR_RCV_2_VC1_FULL BGL_UPC_TR_RCV_2_HDR_PARITY_ERR

Counter description Receiver 0 header parity error Receiver 0 CRC error Receiver 0 unexpected header error Receiver 0 resynch-mode (after error) Receiver 0 SRAM error corrected Receiver 1 vc0 data packet received Receiver 1 vc1 data packet received Receiver 1 vc0 empty packet Receiver 1 vc1 empty packet Receiver 1 IDLE packet Receiver 1 known-bad-packet marker Receiver 1 vc0 cut-through Receiver 1 vc1 cut-through Receiver 1 vc0 full Receiver 1 vc1 full Receiver 1 header parity error Receiver 1 CRC error Receiver 1 unexpected header error Receiver 1 resynch-mode (after error) Receiver 1 SRAM error corrected Receiver 2 vc0 data packet received Receiver 2 vc1 data packet received Receiver 2 vc0 empty packet Receiver 2 vc1 empty packet Receiver 2 IDLE packet Receiver 2 known-bad-packet marker Receiver 2 vc0 cut-through Receiver 2 vc1 cut-through Receiver 2 vc0 full Receiver 2 vc1 full Receiver 2 header parity error

92

Blue Gene/L: Performance Analysis Tools

Counter ID 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233

Counter name BGL_UPC_TR_RCV_2_CRC_ERR BGL_UPC_TR_RCV_2_UNEXPCT_HDR_ERR BGL_UPC_TR_RCV_2_RESYNCH_MODE_ AFTER_ERR BGL_UPC_TR_RCV_2_SRAM_ERR_CORR BGL_UPC_TR_SNDR_0_VC0_EMPTY BGL_UPC_TR_SNDR_0_VC1_EMPTY BGL_UPC_TR_SNDR_0_VC0_CUT_THROUGH BGL_UPC_TR_SNDR_0_VC1_CUT_THROUGH BGL_UPC_TR_SNDR_0_VC0_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_0_VC1_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_0_VC0_DPKTS_SENT BGL_UPC_TR_SNDR_0_VC1_DPKTS_SENT BGL_UPC_TR_SNDR_0_IDLPKTS_SENT BGL_UPC_TR_SNDR_0_RESEND_ATTS BGL_UPC_TR_SNDR_0_SRAM_ERR_CORR BGL_UPC_TR_SNDR_1_VC0_EMPTY BGL_UPC_TR_SNDR_1_VC1_EMPTY BGL_UPC_TR_SNDR_1_VC0_CUT_THROUGH BGL_UPC_TR_SNDR_1_VC1_CUT_THROUGH BGL_UPC_TR_SNDR_1_VC0_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_1_VC1_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_1_VC0_DPKTS_SENT BGL_UPC_TR_SNDR_1_VC1_DPKTS_SENT BGL_UPC_TR_SNDR_1_IDLPKTS_SENT BGL_UPC_TR_SNDR_1_RESEND_ATTS BGL_UPC_TR_SNDR_1_SRAM_ERR_CORR BGL_UPC_TR_SNDR_2_VC0_EMPTY BGL_UPC_TR_SNDR_2_VC1_EMPTY BGL_UPC_TR_SNDR_2_VC0_CUT_THROUGH BGL_UPC_TR_SNDR_2_VC1_CUT_THROUGH BGL_UPC_TR_SNDR_2_VC0_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_2_VC1_PKT_SENT_TOTAL

Counter description Receiver 2 CRC error Receiver 2 unexpected header error Receiver 2 resynch-mode (after error) Receiver 2 SRAM error corrected Sender 0 vc0 empty Sender 0 vc1 empty Sender 0 vc0 cut-through Sender 0 vc1 cut-through Sender 0 vc0 packet sent (total) Sender 0 vc1 packet sent (total) Sender 0 vc0 DATA packets sent Sender 0 vc1 DATA packets sent Sender 0 IDLE packets sent Sender 0 resend attempts Sender 0 SRAM error corrected Sender 1 vc0 empty Sender 1 vc1 empty Sender 1 vc0 cut-through Sender 1 vc1 cut-through Sender 1 vc0 packet sent (total) Sender 1 vc1 packet sent (total) Sender 1 vc0 DATA packets sent Sender 1 vc1 DATA packets sent Sender 1 IDLE packets sent Sender 1 resend attempts Sender 1 SRAM error corrected Sender 2 vc0 empty Sender 2 vc1 empty Sender 2 vc0 cut-through Sender 2 vc1 cut-through Sender 2 vc0 packet sent (total) Sender 2 vc1 packet sent (total)

Appendix C. Perfmon database table specifications

93

Counter ID 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264

Counter name BGL_UPC_TR_SNDR_2_VC0_DPKTS_SENT BGL_UPC_TR_SNDR_2_VC1_DPKTS_SENT BGL_UPC_TR_SNDR_2_IDLPKTS_SENT BGL_UPC_TR_SNDR_2_RESEND_ATTS BGL_UPC_TR_SNDR_2_SRAM_ERR_CORR BGL_UPC_TR_INJ_VC0_HDR_ADDED BGL_UPC_TR_INJ_VC1_HDR_ADDED BGL_UPC_TR_INJ_VC0_PYLD_ADDED BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_INJ_VC0_PKT_TKN BGL_UPC_TR_INJ_VC1_PKT_TKN BGL_UPC_TR_INJ_SRAM_ERR_CORR BGL_UPC_TR_REC_VC0_PKT_ADDED BGL_UPC_TR_REC_VC1_PKT_ADDED BGL_UPC_TR_REC_VC0_HDR_TKN BGL_UPC_TR_REC_VC1_HDR_TKN BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TR_REC_VC0_PKT_DISC BGL_UPC_TR_REC_VC1_PKT_DISC BGL_UPC_TR_REC_SRAM_ERR_CORR BGL_UPC_TS_XM_32B_CHUNKS BGL_UPC_TS_XM_ACKS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCBN_ TOKENS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCBP_ TOKENS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCD0_ VCD1_TOKENS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCD0_ VCD_VCBN_TOKENS BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XM_TOKEN_ACKS BGL_UPC_TS_XM_VCBN_CHUNKS BGL_UPC_TS_XM_VCBP_CHUNKS

Counter description Sender 2 vc0 DATA packets sent Sender 2 vc1 DATA packets sent Sender 2 IDLE packets sent Sender 2 resend attempts Sender 2 SRAM error corrected Injection vc0 header added Injection vc1 header added Injection vc0 payload added Injection vc1 payload added Injection vc0 packet taken Injection vc1 packet taken Injection SRAM error corrected Reception vc0 packet added Reception vc1 packet added Reception vc0 header taken Reception vc1 header taken Reception vc0 payload taken Reception vc1 payload taken Reception vc0 packet discarded Reception vc1 packet discarded Reception SRAM error corrected XM 32 B chunks XM acks XM link available; no vcbn tokens XM link available; no vcbp tokens XM link available; no vcd0 vcd1 tokens XM link available; no vcd0 vcd; vcbn tokens XM packets XM token/acks XM vcbn chunks XM vcbp chunks

94

Blue Gene/L: Performance Analysis Tools

Counter ID 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292

Counter name BGL_UPC_TS_XM_VCD0_CHUNKS BGL_UPC_TS_XM_VCD1_CHUNKS BGL_UPC_TS_XP_32B_CHUNKS BGL_UPC_TS_XP_ACKS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCBN_ TOKENS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCBP_ TOKENS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCD0_ VCD1_TOKENS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCD0_ VCD_VCBN_TOKENS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_XP_TOKEN_ACKS BGL_UPC_TS_XP_VCBN_CHUNKS BGL_UPC_TS_XP_VCBP_CHUNKS BGL_UPC_TS_XP_VCD0_CHUNKS BGL_UPC_TS_XP_VCD1_CHUNKS BGL_UPC_TS_YM_32B_CHUNKS BGL_UPC_TS_YM_ACKS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCBN_ TOKENS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCBP_ TOKENS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCD0_ VCD1_TOKENS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCD0_ VCD_VCBN_TOKENS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YM_TOKEN_ACKS BGL_UPC_TS_YM_VCBN_CHUNKS BGL_UPC_TS_YM_VCBP_CHUNKS BGL_UPC_TS_YM_VCD0_CHUNKS BGL_UPC_TS_YM_VCD1_CHUNKS BGL_UPC_TS_YP_32B_CHUNKS BGL_UPC_TS_YP_ACKS

Counter description XM vcd0 chunks XM vcd1 chunks XP 32 B chunks XP acks XP link available; no vcbn tokens XP link available; no vcbp tokens XP link available; no vcd0 vcd1 tokens XP link available; no vcd0 vcd; vcbn tokens XP packets XP token/acks XP vcbn chunks XP vcbp chunks XP vcd0 chunks XP vcd1 chunks YM 32 B chunks YM acks YM link available; no vcbn tokens YM link available; no vcbp tokens YM link available; no vcd0 vcd1 tokens YM link available; no vcd0 vcd; vcbn tokens YM packets YM token/acks YM vcbn chunks YM vcbp chunks YM vcd0 chunks YM vcd1 chunks YP 32 B chunks YP acks

Appendix C. Perfmon database table specifications

95

Counter ID 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319

Counter name BGL_UPC_TS_YP_LINK_AVAIL_NO_VCBN_ TOKENS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCBP_ TOKENS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCD0_ VCD1_TOKENS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCD0_ VCD_VCBN_TOKENS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_YP_TOKEN_ACKS BGL_UPC_TS_YP_VCBN_CHUNKS BGL_UPC_TS_YP_VCBP_CHUNKS BGL_UPC_TS_YP_VCD0_CHUNKS BGL_UPC_TS_YP_VCD1_CHUNKS BGL_UPC_TS_ZM_32B_CHUNKS BGL_UPC_TS_ZM_ACKS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCBN_ TOKENS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCBP_ TOKENS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCD0_ VCD1_TOKENS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCD0_ VCD_VCBN_TOKENS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZM_TOKEN_ACKS BGL_UPC_TS_ZM_VCBN_CHUNKS BGL_UPC_TS_ZM_VCBP_CHUNKS BGL_UPC_TS_ZM_VCD0_CHUNKS BGL_UPC_TS_ZM_VCD1_CHUNKS BGL_UPC_TS_ZP_32B_CHUNKS BGL_UPC_TS_ZP_ACKS BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCBN_ TOKENS BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCBP_ TOKENS BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCD0_ VCD1_TOKENS

Counter description YP link available; no vcbn tokens YP link available; no vcbp tokens YP link available; no vcd0 vcd1 tokens YP link available; no vcd0 vcd; vcbn tokens YP packets YP token/acks YP vcbn chunks YP vcbp chunks YP vcd0 chunks YP vcd1 chunks ZM 32 B chunks ZM acks ZM link available; no vcbn tokens ZM link available; no vcbp tokens ZM link available; no vcd0 vcd1 tokens ZM link available; no vcd0 vcd; vcbn tokens ZM packets ZM token/acks ZM vcbn chunks ZM vcbp chunks ZM vcd0 chunks ZM vcd1 chunks ZP 32 B chunks ZP acks ZP link available; no vcbn tokens ZP link available; no vcbp tokens ZP link available; no vcd0 vcd1 tokens

96

Blue Gene/L: Performance Analysis Tools

Counter ID 320 321 322 323 324 325 326 327 10000 10001 10002 10003

Counter name BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCD0_ VCD_VCBN_TOKENS BGL_UPC_TS_ZP_PKTS BGL_UPC_TS_ZP_TOKEN_ACKS BGL_UPC_TS_ZP_VCBN_CHUNKS BGL_UPC_TS_ZP_VCBP_CHUNKS BGL_UPC_TS_ZP_VCD0_CHUNKS BGL_UPC_TS_ZP_VCD1_CHUNKS BGL_PERFCTR_NULL_EVENT BGL_FPU_DERIVED_COUNTER_1 BGL_FPU_DERIVED_COUNTER_2 BGL_FPU_DERIVED_COUNTER_3 BGL_FPU_DERIVED_COUNTER_4

Counter description ZP link available; no vcd0 vcd; vcbn tokens ZP packets ZP token/acks ZP vcbn chunks ZP vcbp chunks ZP vcd0 chunks ZP vcd1 chunks Null event Round robins FPU counters 0, 4, 8, 12 Round robins FPU counters 1, 5, 9, 13 Round robins FPU counters 2, 6, 10, 14 Round robins FPU counters 3, 7, 11, 15

BGLPERFDEF and BGLPERFDESC table join


The tables in this section describe the possible sets of counter definitions that are available. These sets of counter definitions are identified by counter definition ID. The counter definition ID to be used for a job is provided by the BGL_PERFMON environment variable. If an application has the performance library linked (that is, it is updating the performance counters), but no counter definition ID is specified for the job with the BGL_PERFMON environment variable, then by default, counter definition ID 1000 is used for the job. The columns that are listed result in a join of tables BGLPERFDEF and BGLPERFDESC on column COUNTER_ID.
Table C-9 BGLPERFDEF/BGLPERFDESC table join: Counter definition ID 0 Counter Definition ID 0 Counter ID 0 4 8 12 Counter name BGL_FPU_ARITH_ADD_SUBTRACT BGL_FPU_LDST_DBL_LD BGL_2NDFPU_ARITH_ADD_SUBTRACT BGL_2NDFPU_LDST_DBL_LD Counter description Add and subtract, add, fads, fussy, fusses (Book E add, subtract) Double loads, led, lefts, left, lefts, livest, lfsdux (double word loads, no single precision) Second FPU add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Second FPU double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Cache hit

16

BGL_UPC_L3_CACHE_HIT

Appendix C. Perfmon database table specifications

97

Counter Definition ID 0 Counter ID 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 261 273 Counter name BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY _DDR BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_L3_EDRAM_ACCESS_CYCLE BGL_UPC_L3_EDRAM_RFR_CYCLE BGL_UPC_L3_LINE_STARTS_EVICT_LINE_NUM_ PRESSURE BGL_UPC_L3_MISS_DIR_SET_DISBL BGL_UPC_L3_MISS_NO_WAY_SET_AVAIL BGL_UPC_L3_MISS_REQUIRING_CASTOUT BGL_UPC_L3_MISS_REQUIRING_REFILL_NO_ WR_ALLOC BGL_UPC_L3_MSHNDLR_TOOK_REQ BGL_UPC_L3_MSHNDLR_TOOK_REQ_PLB_RDQ BGL_UPC_L3_MSHNDLR_TOOK_REQ_RDQ0 BGL_UPC_L3_MSHNDLR_TOOK_REQ_RDQ1 BGL_UPC_L3_MSHNDLR_TOOK_REQ_WRBUF BGL_UPC_L3_PAGE_CLOSE BGL_UPC_L3_PAGE_OPEN BGL_UPC_L3_PLB_WRQ_DEP_DBUF BGL_UPC_L3_PLB_WRQ_DEP_DBUF_HIT BGL_UPC_L3_PREF_REINS_PULL_OUT_NEXT_ LINE BGL_UPC_L3_PREF_REQ_ACC_BY_PREF_UNIT BGL_UPC_L3_RD_BURST_1024B_LINE_RD BGL_UPC_L3_RD_EDR__ALL_KINDS_OF_RD BGL_UPC_L3_RD_MODIFY_WR_CYCLE_EDR BGL_UPC_L3_REQ_TKN_CACHE_INHIB_RD_ REQ BGL_UPC_L3_REQ_TKN_CACHE_INHIB_WR BGL_UPC_L3_REQ_TKN_NEEDS_CASTOUT BGL_UPC_L3_REQ_TKN_NEEDS_REFILL BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS Counter description Cache miss; data already on the way from DDR Cache miss; data is requested from DDR EDRAM access cycle EDRAM refresh cycle Line starts to evict due to line numb pressure Miss, but this directory set is disabled Miss and no way in this set is available Miss requiring a castout Miss requiring a refill (no write allocation) Miss handler took request Miss handler took request from PLB read queue Miss handler took request from read queue 0 Miss handler took request from read queue 1 Miss handler took request from write buffer Page close occurred Page open occurred PLB write queue deposits data into buffer PLB write queue deposits data into buffer (hit) Prefetch reinserted to pull out next line Prefetch request accepted by prefetch unit Read burst (1024b line read) occurred Read from EDR occurred (all kinds of read) Read-modify-write cycle to EDR occurred Request taken is a cache inhibited read request Request taken is a cache inhibited write Request taken needs castout Request taken needs refill XM packets XP packets

98

Blue Gene/L: Performance Analysis Tools

Counter Definition ID 0 Counter ID 285 297 309 321 327 Counter name BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description YM packets YP packets ZM packets ZP packets Null event

Table C-10 BGLPERFDEF table: Counter definition ID 1 Counter Definition ID 1 Counter ID 1 5 9 13 Counter name BGL_FPU_ARITH_MULT_DIV BGL_FPU_LDST_DBL_ST BGL_2NDFPU_ARITH_MULT_DIV BGL_2NDFPU_LDST_DBL_ST Counter description Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Second FPU multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Second FPU double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Write buffer line was allocated Write queue 0 deposits data into buffer Write queue 0 deposits data into buffer (hit) Write to EDRAM occurred (including RMW) DCURD 1 read pending DCURD 2 reads pending DCURD 3 reads pending DCURD BLIND request DCURD coherency stall (WAR) DCURD L3 request DCURD L3 request pending DCURD LINK request DCURD LINK request pending DCURD LOCK request DCURD LOCK request pending

44 45 46 49 50 51 52 53 54 55 56 57 58 59 60

BGL_UPC_L3_WRBUF_LINE_ALLOC BGL_UPC_L3_WRQ0_DEP_DBUF BGL_UPC_L3_WRQ0_DEP_DBUF_HIT BGL_UPC_L3_WR_EDRAM__INCLUDING_RMW BGL_UPC_PU0_DCURD_1_RD_PEND BGL_UPC_PU0_DCURD_2_RD_PEND BGL_UPC_PU0_DCURD_3_RD_PEND BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_COHERENCY_STALL_ WAR BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_L3_REQ_PEND BGL_UPC_PU0_DCURD_LINK_REQ BGL_UPC_PU0_DCURD_LINK_REQ_PEND BGL_UPC_PU0_DCURD_LOCK_REQ BGL_UPC_PU0_DCURD_LOCK_REQ_PEND

Appendix C. Perfmon database table specifications

99

Counter Definition ID 1 Counter ID 63 64 65 66 67 68 71 72 73 74 75 76 77 78 261 273 285 297 309 321 327 Counter name BGL_UPC_PU0_DCURD_RD_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_DCURD_SRAM_REQ_PEND BGL_UPC_PU0_DCURD_WAIT_L3 BGL_UPC_PU0_DCURD_WAIT_LINK BGL_UPC_PU0_DCURD_WAIT_LOCK BGL_UPC_PU0_PREF_FILTER_HIT BGL_UPC_PU0_PREF_PREF_PEND BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT BGL_UPC_PU0_PREF_SNOOP_HIT_OTHER BGL_UPC_PU0_PREF_SNOOP_HIT_PLB BGL_UPC_PU0_PREF_SNOOP_HIT_SAME BGL_UPC_PU0_PREF_STREAM_HIT BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description DCURD read request DCURD SRAM request DCURD SRAM request pending DCURD wait for L3 DCURD wait for LINK DCURD wait for LOCK Prefetch filter hit Prefetch prefetch pending Prefetch request valid Prefetch self hit Prefetch snoop hit other Prefetch snoop hit PLB Prefetch snoop hit same Prefetch stream hit XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

100

Blue Gene/L: Performance Analysis Tools

Table C-11 BGLPERFDEF table: Counter definition ID 2 Counter definition ID 2 Counter ID 2 6 10 Counter name BGL_FPU_ARITH_OEDIPUS_OP BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_ARITH_OEDIPUS_OP Counter description Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Second FPU Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Second FPU quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Write queue 1 deposits data into buffer Write queue 1 deposits data into buffer (hit) DCURD PLB request DCURD PLB request pending DCURD wait for PLB DCURD wait for SRAM DCURD PLB request DCURD PLB request pending DCURD SRAM request pending DCURD wait for PLB DCURD wait for SRAM Arbiter inj_vc0_have Arbiter_core ch2_vc0_mature Receiver 0 vc1 cut-through XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

14 47 48 61 62 69 70 90 91 94 98 99 112 117 168 261 273 285 297 309 321 327

BGL_2NDFPU_LDST_QUAD_LD BGL_UPC_L3_WRQ1_DEP_DBUF BGL_UPC_L3_WRQ1_DEP_DBUF_HIT BGL_UPC_PU0_DCURD_PLB_REQ BGL_UPC_PU0_DCURD_PLB_REQ_PEND BGL_UPC_PU0_DCURD_WAIT_PLB BGL_UPC_PU0_DCURD_WAIT_SRAM BGL_UPC_PU1_DCURD_PLB_REQ BGL_UPC_PU1_DCURD_PLB_REQ_PEND BGL_UPC_PU1_DCURD_SRAM_REQ_PEND BGL_UPC_PU1_DCURD_WAIT_PLB BGL_UPC_PU1_DCURD_WAIT_SRAM BGL_UPC_TR_ARB_INJ_VC0_HAVE BGL_UPC_TR_ARB_CORE_CH2_VC0_MATURE BGL_UPC_TR_RCV_0_VC1_CUT_THROUGH BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT

Appendix C. Perfmon database table specifications

101

Table C-12 BGLPERFDEF table: Counter definition ID 3 Counter definition ID 3 Counter ID 3 7 11 Counter name BGL_FPU_ARITH_TRINARY_OP BGL_FPU_LDST_QUAD_ST BGL_2NDFPU_ARITH_TRINARY_OP Counter description Trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Second FPU trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Second FPU quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) DCURD 1 read pending DCURD 2 reads pending DCURD 3 reads pending DCURD BLIND request DCURD coherency stall (WAR) DCURD L3 request DCURD L3 request pending DCURD LINK request DCURD LINK request pending DCURD LOCK request DCURD LOCK request pending DCURD read request DCURD SRAM request DCURD wait for L3 DCURD wait for LINK DCURD wait for LOCK Prefetch filter hit Prefetch prefetch pending Prefetch request valid Prefetch self hit Prefetch snoop hit other Prefetch snoop hit PLB Prefetch snoop hit same Prefetch stream hit

15 79 80 81 82 83 84 85 86 87 88 89 92 93 95 96 97 100 101 102 103 104 105 106 107

BGL_2NDFPU_LDST_QUAD_ST BGL_UPC_PU1_DCURD_1_RD_PEND BGL_UPC_PU1_DCURD_2_RD_PEND BGL_UPC_PU1_DCURD_3_RD_PEND BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_COHERENCY_ STALL_WAR BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_L3_REQ_PEND BGL_UPC_PU1_DCURD_LINK_REQ BGL_UPC_PU1_DCURD_LINK_REQ_PEND BGL_UPC_PU1_DCURD_LOCK_REQ BGL_UPC_PU1_DCURD_LOCK_REQ_PEND BGL_UPC_PU1_DCURD_RD_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_DCURD_WAIT_L3 BGL_UPC_PU1_DCURD_WAIT_LINK BGL_UPC_PU1_DCURD_WAIT_LOCK BGL_UPC_PU1_PREF_FILTER_HIT BGL_UPC_PU1_PREF_PREF_PEND BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_PU1_PREF_SNOOP_HIT_OTHER BGL_UPC_PU1_PREF_SNOOP_HIT_PLB BGL_UPC_PU1_PREF_SNOOP_HIT_SAME BGL_UPC_PU1_PREF_STREAM_HIT

102

Blue Gene/L: Performance Analysis Tools

Counter definition ID 3 Counter ID 108 146 261 273 285 297 309 321 327 Counter name BGL_UPC_TI_TESTINT_ERR_EVENT BGL_UPC_TR_ARB_SNDR0_BUSY BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description Testint error event Arbiter sender 0 busy XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

Table C-13 BGLPERFDEF table: Counter definition ID 4 Counter definition ID 4 Counter ID 8 12 Counter name BGL_2NDFPU_ARITH_ADD_SUBTRACT BGL_2NDFPU_LDST_DBL_LD Counter description Second FPU add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) 2nd FPU double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) XM 32 B chunks XM acks XM link available; no vcbn tokens XM link available; no vcbp tokens XM link available; no vcd0 vcd1 tokens XM link available; no vcd0 vcd; vcbn tokens XM packets XM token/acks XM vcbn chunks XM vcbp chunks XM vcd0 chunks XM vcd1 chunks XP 32 B chunks

255 256 257 258 259 260 261 262 263 264 265 266 267

BGL_UPC_TS_XM_32B_CHUNKS BGL_UPC_TS_XM_ACKS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCBN_TOKENS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCBP_TOKENS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCD0_VCD1_ TOKENS BGL_UPC_TS_XM_LINK_AVAIL_NO_VCD0_VCD_ VCBN_TOKENS BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XM_TOKEN_ACKS BGL_UPC_TS_XM_VCBN_CHUNKS BGL_UPC_TS_XM_VCBP_CHUNKS BGL_UPC_TS_XM_VCD0_CHUNKS BGL_UPC_TS_XM_VCD1_CHUNKS BGL_UPC_TS_XP_32B_CHUNKS

Appendix C. Perfmon database table specifications

103

Counter definition ID 4 Counter ID 268 269 270 271 272 273 274 275 276 277 278 285 297 309 321 327 Counter name BGL_UPC_TS_XP_ACKS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCBN_TOKENS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCBP_TOKENS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCD0_VCD1_ TOKENS BGL_UPC_TS_XP_LINK_AVAIL_NO_VCD0_VCD_ VCBN_TOKENS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_XP_TOKEN_ACKS BGL_UPC_TS_XP_VCBN_CHUNKS BGL_UPC_TS_XP_VCBP_CHUNKS BGL_UPC_TS_XP_VCD0_CHUNKS BGL_UPC_TS_XP_VCD1_CHUNKS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description XP acks XP link available; no vcbn tokens XP link available; no vcbp tokens XP link available; no vcd0 vcd1 tokens XP link available; no vcd0 vcd; vcbn tokens XP packets XP token/acks XP vcbn chunks XP vcbp chunks XP vcd0 chunks XP vcd1 chunks YM packets YP packets ZM packets ZP packets Null event

Table C-14 BGLPERFDEF table: Counter definition ID 5 Counter definition ID 5 Counter ID 1 5 Counter name BGL_FPU_ARITH_MULT_DIV BGL_FPU_LDST_DBL_ST Counter description Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Second FPU multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Second FPU double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Arbiter class 9 wins XM packets

9 13

BGL_2NDFPU_ARITH_MULT_DIV BGL_2NDFPU_LDST_DBL_ST

137 261

BGL_UPC_TR_ARB_CLASS9_WINS BGL_UPC_TS_XM_PKTS

104

Blue Gene/L: Performance Analysis Tools

Counter definition ID 5 Counter ID 273 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 309 321 327 Counter name BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_32B_CHUNKS BGL_UPC_TS_YM_ACKS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCBN_TOKENS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCBP_TOKENS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCD0_VCD1_ TOKENS BGL_UPC_TS_YM_LINK_AVAIL_NO_VCD0_VCD_VCBN_ TOKENS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YM_TOKEN_ACKS BGL_UPC_TS_YM_VCBN_CHUNKS BGL_UPC_TS_YM_VCBP_CHUNKS BGL_UPC_TS_YM_VCD0_CHUNKS BGL_UPC_TS_YM_VCD1_CHUNKS BGL_UPC_TS_YP_32B_CHUNKS BGL_UPC_TS_YP_ACKS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCBN_TOKENS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCBP_TOKENS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCD0_VCD1_ TOKENS BGL_UPC_TS_YP_LINK_AVAIL_NO_VCD0_VCD_VCBN_ TOKENS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_YP_TOKEN_ACKS BGL_UPC_TS_YP_VCBN_CHUNKS BGL_UPC_TS_YP_VCBP_CHUNKS BGL_UPC_TS_YP_VCD0_CHUNKS BGL_UPC_TS_YP_VCD1_CHUNKS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description XP packets YM 32 B chunks YM acks YM link available; no vcbn tokens YM link available; no vcbp tokens YM link available; no vcd0 vcd1 tokens YM link available; no vcd0 vcd; vcbn tokens YM packets YM token/acks YM vcbn chunks YM vcbp chunks YM vcd0 chunks YM vcd1 chunks YP 32 B chunks YP acks YP link available; no vcbn tokens YP link available; no vcbp tokens YP link available; no vcd0 vcd1 tokens YP link available; no vcd0 vcd; vcbn tokens YP packets YP token/acks YP vcbn chunks YP vcbp chunks YP vcd0 chunks YP vcd1 chunks ZM packets ZP packets Null event

Appendix C. Perfmon database table specifications

105

Table C-15 BGLPERFDEF table: Counter definition ID 6 Counter definition ID 6 Counter ID 2 Counter name Counter description

BGL_FPU_ARITH_OEDIPUS_OP

Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Second FPU Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Second FPU quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) XM packets XP packets YM packets YP packets ZM 32 B chunks ZM acks ZM link available; no vcbn tokens ZM link available; no vcbp tokens ZM link available; no vcd0 vcd1 tokens ZM link available; no vcd0 vcd; vcbn tokens ZM packets ZM token/acks ZM vcbn chunks ZM vcbp chunks ZM vcd0 chunks ZM vcd1 chunks ZP 32 B chunks ZP acks ZP link available; no vcbn tokens ZP link available; no vcbp tokens ZP link available; no vcd0 vcd1 tokens

6 10

BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_ARITH_OEDIPUS_OP

14 261 273 285 297 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319

BGL_2NDFPU_LDST_QUAD_LD BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_32B_CHUNKS BGL_UPC_TS_ZM_ACKS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCBN_TOKENS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCBP_TOKENS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCD0_VCD1_ TOKENS BGL_UPC_TS_ZM_LINK_AVAIL_NO_VCD0_VCD_VCBN_ TOKENS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZM_TOKEN_ACKS BGL_UPC_TS_ZM_VCBN_CHUNKS BGL_UPC_TS_ZM_VCBP_CHUNKS BGL_UPC_TS_ZM_VCD0_CHUNKS BGL_UPC_TS_ZM_VCD1_CHUNKS BGL_UPC_TS_ZP_32B_CHUNKS BGL_UPC_TS_ZP_ACKS BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCBN_TOKENS BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCBP_TOKENS BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCD0_VCD1_ TOKENS

106

Blue Gene/L: Performance Analysis Tools

Counter definition ID 6 Counter ID 320 321 322 323 324 325 326 327 Counter name Counter description

BGL_UPC_TS_ZP_LINK_AVAIL_NO_VCD0_VCD_VCBN_ TOKENS BGL_UPC_TS_ZP_PKTS BGL_UPC_TS_ZP_TOKEN_ACKS BGL_UPC_TS_ZP_VCBN_CHUNKS BGL_UPC_TS_ZP_VCBP_CHUNKS BGL_UPC_TS_ZP_VCD0_CHUNKS BGL_UPC_TS_ZP_VCD1_CHUNKS BGL_PERFCTR_NULL_EVENT

ZP link available; no vcd0 vcd; vcbn tokens ZP packets ZP token/acks ZP vcbn chunks ZP vcbp chunks ZP vcd0 chunks ZP vcd1 chunks Null event

Table C-16 BGLPERFDEF table: Counter definition ID 7 Counter definition ID 7 Counter ID 3 7 11 15 109 110 111 113 114 129 131 132 135 136 138 139 Counter name BGL_FPU_ARITH_TRINARY_OP BGL_FPU_LDST_QUAD_ST BGL_2NDFPU_ARITH_TRINARY_OP BGL_2NDFPU_LDST_QUAD_ST BGL_UPC_TR_ARB_CH2_VC0_HAVE BGL_UPC_TR_ARB_CH1_VC0_HAVE BGL_UPC_TR_ARB_CH0_VC0_HAVE BGL_UPC_TR_ARB_CH2_VC1_HAVE BGL_UPC_TR_ARB_CH1_VC1_HAVE BGL_UPC_TR_ARB_CLASS1_WINS BGL_UPC_TR_ARB_CLASS3_WINS BGL_UPC_TR_ARB_CLASS4_WINS BGL_UPC_TR_ARB_CLASS7_WINS BGL_UPC_TR_ARB_CLASS8_WINS BGL_UPC_TR_ARB_CLASS10_WINS BGL_UPC_TR_ARB_CLASS11_WINS Counter description Trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Second FPU trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Second FPU quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Arbiter ch2_vc0_have Arbiter ch1_vc0_have Arbiter ch0_vc0_have Arbiter ch2_vc1_have Arbiter ch1_vc1_have Arbiter class 1 wins Arbiter class 3 wins Arbiter class 4 wins Arbiter class 7 wins Arbiter class 8 wins Arbiter class 10 wins Arbiter class 11 wins

Appendix C. Perfmon database table specifications

107

Counter definition ID 7 Counter ID 140 141 142 143 144 145 148 149 150 152 261 273 285 297 309 321 327 Counter name BGL_UPC_TR_ARB_CLASS12_WINS BGL_UPC_TR_ARB_CLASS13_WINS BGL_UPC_TR_ARB_CLASS14_WINS BGL_UPC_TR_ARB_CLASS15_WINS BGL_UPC_TR_ARB_SNDR2_BUSY BGL_UPC_TR_ARB_SNDR1_BUSY BGL_UPC_TR_ARB_RCV2_BUSY BGL_UPC_TR_ARB_RCV1_BUSY BGL_UPC_TR_ARB_RCV0_BUSY BGL_UPC_TR_ARB_ALU_BUSY BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description Arbiter class 12 wins Arbiter class 13 wins Arbiter class 14 wins Arbiter class 15 wins Arbiter sender 2 busy Arbiter sender 1 busy Arbiter receiver 2 busy Arbiter receiver 1 busy Arbiter receiver 0 busy Arbiter alu busy XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

Table C-17 BGLPERFDEF table: Counter definition ID 8 Counter definition ID 8 Counter ID 0 4 8 12 239 240 241 Counter name Counter description

BGL_FPU_ARITH_ADD_SUBTRACT BGL_FPU_LDST_DBL_LD BGL_2NDFPU_ARITH_ADD_SUBTRACT BGL_2NDFPU_LDST_DBL_LD BGL_UPC_TR_INJ_VC0_HDR_ADDED BGL_UPC_TR_INJ_VC1_HDR_ADDED BGL_UPC_TR_INJ_VC0_PYLD_ADDED

Add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Second FPU add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Second FPU double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Injection vc0 header added Injection vc1 header added Injection vc0 payload added

108

Blue Gene/L: Performance Analysis Tools

Counter definition ID 8 Counter ID 242 243 244 245 246 247 248 249 250 251 252 253 254 261 273 285 297 309 321 327 Counter name Counter description

BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_INJ_VC0_PKT_TKN BGL_UPC_TR_INJ_VC1_PKT_TKN BGL_UPC_TR_INJ_SRAM_ERR_CORR BGL_UPC_TR_REC_VC0_PKT_ADDED BGL_UPC_TR_REC_VC1_PKT_ADDED BGL_UPC_TR_REC_VC0_HDR_TKN BGL_UPC_TR_REC_VC1_HDR_TKN BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TR_REC_VC0_PKT_DISC BGL_UPC_TR_REC_VC1_PKT_DISC BGL_UPC_TR_REC_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT

Injection vc1 payload added Injection vc0 packet taken Injection vc1 packet taken Injection SRAM error corrected Reception vc0 packet added Reception vc1 packet added Reception vc0 header taken Reception vc1 header taken Reception vc0 payload taken Reception vc1 payload taken Reception vc0 packet discarded Reception vc1 packet discarded Reception SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

Appendix C. Perfmon database table specifications

109

Table C-18 BGLPERFDEF table: Counter definition ID 9 Counter definition ID 9 Counter ID 1 5 9 13 Counter name BGL_FPU_ARITH_MULT_DIV BGL_FPU_LDST_DBL_ST BGL_2NDFPU_ARITH_MULT_DIV BGL_2NDFPU_LDST_DBL_ST Counter description Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Second FPU multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Second FPU double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Receiver 0 vc0 data packet received Receiver 0 vc1 data packet received Receiver 0 vc0 empty packet Receiver 0 vc1 empty packet Receiver 0 IDLE packet Receiver 0 known-bad-packet marker Receiver 0 vc0 cut-through Receiver 0 vc0 full Receiver 0 vc1 full Receiver 0 header parity error Receiver 0 CRC error Receiver 0 unexpected header error Receiver 0 SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

161 162 163 164 165 166 167 169 170 171 172 173 175 261 273 285 297 309 321 327

BGL_UPC_TR_RCV_0_VC0_DPKT_RCV BGL_UPC_TR_RCV_0_VC1_DPKT_RCV BGL_UPC_TR_RCV_0_VC0_EMPTY_PKT BGL_UPC_TR_RCV_0_VC1_EMPTY_PKT BGL_UPC_TR_RCV_0_IDLPKT BGL_UPC_TR_RCV_0_KNOWN_BAD_PKT_ MARKER BGL_UPC_TR_RCV_0_VC0_CUT_THROUGH BGL_UPC_TR_RCV_0_VC0_FULL BGL_UPC_TR_RCV_0_VC1_FULL BGL_UPC_TR_RCV_0_HDR_PARITY_ERR BGL_UPC_TR_RCV_0_CRC_ERR BGL_UPC_TR_RCV_0_UNEXPCT_HDR_ERR BGL_UPC_TR_RCV_0_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT

110

Blue Gene/L: Performance Analysis Tools

Table C-19 BGLPERFDEF table: Counter definition ID 10 Counter definition ID 10 Counter ID 2 6 10 Counter name Counter description

BGL_FPU_ARITH_OEDIPUS_OP BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_ARITH_OEDIPUS_OP

Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Second FPU Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Second FPU quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Receiver 1 vc0 data packet received Receiver 1 vc1 data packet received Receiver 1 vc0 empty packet Receiver 1 vc1 empty packet Receiver 1 IDLE packet Receiver 1 known-bad-packet marker Receiver 1 vc0 cut-through Receiver 1 vc1 cut-through Receiver 1 vc0 full Receiver 1 vc1 full Receiver 1 header parity error Receiver 1 CRC error Receiver 1 unexpected header error Receiver 1 SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

14 176 177 178 179 180 181 182 183 184 185 186 187 188 190 261 273 285 297 309 321 327

BGL_2NDFPU_LDST_QUAD_LD BGL_UPC_TR_RCV_1_VC0_DPKT_RCV BGL_UPC_TR_RCV_1_VC1_DPKT_RCV BGL_UPC_TR_RCV_1_VC0_EMPTY_PKT BGL_UPC_TR_RCV_1_VC1_EMPTY_PKT BGL_UPC_TR_RCV_1_IDLPKT BGL_UPC_TR_RCV_1_KNOWN_BAD_PKT_ MARKER BGL_UPC_TR_RCV_1_VC0_CUT_THROUGH BGL_UPC_TR_RCV_1_VC1_CUT_THROUGH BGL_UPC_TR_RCV_1_VC0_FULL BGL_UPC_TR_RCV_1_VC1_FULL BGL_UPC_TR_RCV_1_HDR_PARITY_ERR BGL_UPC_TR_RCV_1_CRC_ERR BGL_UPC_TR_RCV_1_UNEXPCT_HDR_ERR BGL_UPC_TR_RCV_1_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT

Appendix C. Perfmon database table specifications

111

Table C-20 BGLPERFDEF table: Counter definition ID 11 Counter definition ID 11 Counter ID 3 7 11 Counter name BGL_FPU_ARITH_TRINARY_OP BGL_FPU_LDST_QUAD_ST BGL_2NDFPU_ARITH_TRINARY_OP Counter description Trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Second FPU trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Second FPU quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Receiver 2 vc0 data packet received Receiver 2 vc1 data packet received Receiver 2 vc0 empty packet Receiver 2 vc1 empty packet Receiver 2 IDLE packet Receiver 2 known-bad-packet marker Receiver 2 vc0 cut-through Receiver 2 vc1 cut-through Receiver 2 vc0 full Receiver 2 vc1 full Receiver 2 header parity error Receiver 2 CRC error Receiver 2 unexpected header error Receiver 2 SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

15 191 192 193 194 195 196 197 198 199 200 201 202 203 205 261 273 285 297 309 321 327

BGL_2NDFPU_LDST_QUAD_ST BGL_UPC_TR_RCV_2_VC0_DPKT_RCV BGL_UPC_TR_RCV_2_VC1_DPKT_RCV BGL_UPC_TR_RCV_2_VC0_EMPTY_PKT BGL_UPC_TR_RCV_2_VC1_EMPTY_PKT BGL_UPC_TR_RCV_2_IDLPKT BGL_UPC_TR_RCV_2_KNOWN_BAD_PKT_ MARKER BGL_UPC_TR_RCV_2_VC0_CUT_THROUGH BGL_UPC_TR_RCV_2_VC1_CUT_THROUGH BGL_UPC_TR_RCV_2_VC0_FULL BGL_UPC_TR_RCV_2_VC1_FULL BGL_UPC_TR_RCV_2_HDR_PARITY_ERR BGL_UPC_TR_RCV_2_CRC_ERR BGL_UPC_TR_RCV_2_UNEXPCT_HDR_ERR BGL_UPC_TR_RCV_2_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT

112

Blue Gene/L: Performance Analysis Tools

Table C-21 BGLPERFDEF table: Counter definition ID 12 Counter definition ID 12 Counter ID 0 4 8 12 228 229 230 231 232 233 234 235 236 237 238 261 273 285 297 309 321 327 Counter name BGL_FPU_ARITH_ADD_SUBTRACT BGL_FPU_LDST_DBL_LD BGL_2NDFPU_ARITH_ADD_SUBTRACT BGL_2NDFPU_LDST_DBL_LD BGL_UPC_TR_SNDR_2_VC0_EMPTY BGL_UPC_TR_SNDR_2_VC1_EMPTY BGL_UPC_TR_SNDR_2_VC0_CUT_THROUGH BGL_UPC_TR_SNDR_2_VC1_CUT_THROUGH BGL_UPC_TR_SNDR_2_VC0_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_2_VC1_PKT_SENT_TOTAL BGL_UPC_TR_SNDR_2_VC0_DPKTS_SENT BGL_UPC_TR_SNDR_2_VC1_DPKTS_SENT BGL_UPC_TR_SNDR_2_IDLPKTS_SENT BGL_UPC_TR_SNDR_2_RESEND_ATTS BGL_UPC_TR_SNDR_2_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description Add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Second FPU add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Second FPU double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Sender 2 vc0 empty Sender 2 vc1 empty Sender 2 vc0 cut-through Sender 2 vc1 cut-through Sender 2 vc0 packet sent (total) Sender 2 vc1 packet sent (total) Sender 2 vc0 DATA packets sent Sender 2 vc1 DATA packets sent Sender 2 IDLE packets sent Sender 2 resend attempts Sender 2 SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

Appendix C. Perfmon database table specifications

113

Table C-22 BGLPERFDEF table: Counter definition ID 13 Counter definition ID 13 Counter ID 1 5 9 13 217 218 219 220 221 222 223 224 225 226 227 261 273 285 297 309 321 327 Counter name BGL_FPU_ARITH_MULT_DIV BGL_FPU_LDST_DBL_ST BGL_2NDFPU_ARITH_MULT_DIV BGL_2NDFPU_LDST_DBL_ST BGL_UPC_TR_SNDR_1_VC0_EMPTY BGL_UPC_TR_SNDR_1_VC1_EMPTY BGL_UPC_TR_SNDR_1_VC0_CUT_ THROUGH BGL_UPC_TR_SNDR_1_VC1_CUT_ THROUGH BGL_UPC_TR_SNDR_1_VC0_PKT_SENT_ TOTAL BGL_UPC_TR_SNDR_1_VC1_PKT_SENT_ TOTAL BGL_UPC_TR_SNDR_1_VC0_DPKTS_SENT BGL_UPC_TR_SNDR_1_VC1_DPKTS_SENT BGL_UPC_TR_SNDR_1_IDLPKTS_SENT BGL_UPC_TR_SNDR_1_RESEND_ATTS BGL_UPC_TR_SNDR_1_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Second FPU multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Second FPU double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Sender 1 vc0 empty Sender 1 vc1 empty Sender 1 vc0 cut-through Sender 1 vc1 cut-through Sender 1 vc0 packet sent (total) Sender 1 vc1 packet sent (total) Sender 1 vc0 DATA packets sent Sender 1 vc1 DATA packets sent Sender 1 IDLE packets sent Sender 1 resend attempts Sender 1 SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

114

Blue Gene/L: Performance Analysis Tools

Table C-23 BGLPERFDEF table: Counter definition ID 14 Counter definition ID 14 Counter ID 2 6 10 Counter name BGL_FPU_ARITH_OEDIPUS_OP BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_ARITH_OEDIPUS_OP Counter description Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Second FPU Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Second FPU quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Sender 0 vc0 empty Sender 0 vc1 empty Sender 0 vc0 cut-through Sender 0 vc1 cut-through Sender 0 vc0 packet sent (total) Sender 0 vc1 packet sent (total) Sender 0 vc0 DATA packets sent Sender 0 vc1 DATA packets sent Sender 0 IDLE packets sent Sender 0 resend attempts Sender 0 SRAM error corrected XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

14 206 207 208 209 210 211 212 213 214 215 216 261 273 285 297 309 321 327

BGL_2NDFPU_LDST_QUAD_LD BGL_UPC_TR_SNDR_0_VC0_EMPTY BGL_UPC_TR_SNDR_0_VC1_EMPTY BGL_UPC_TR_SNDR_0_VC0_CUT_THROUGH BGL_UPC_TR_SNDR_0_VC1_CUT_THROUGH BGL_UPC_TR_SNDR_0_VC0_PKT_SENT_ TOTAL BGL_UPC_TR_SNDR_0_VC1_PKT_SENT_ TOTAL BGL_UPC_TR_SNDR_0_VC0_DPKTS_SENT BGL_UPC_TR_SNDR_0_VC1_DPKTS_SENT BGL_UPC_TR_SNDR_0_IDLPKTS_SENT BGL_UPC_TR_SNDR_0_RESEND_ATTS BGL_UPC_TR_SNDR_0_SRAM_ERR_CORR BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT

Appendix C. Perfmon database table specifications

115

Table C-24 BGLPERFDEF table: Counter definition ID 15 Counter definition ID 15 Counter ID 3 7 11 Counter name BGL_FPU_ARITH_TRINARY_OP BGL_FPU_LDST_QUAD_ST BGL_2NDFPU_ARITH_TRINARY_OP Counter description Trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Second FPU trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Second FPU quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Arbiter ch0_vc1_have Arbiter inj_vc1_have Arbiter_core ch1_vc0_mature Arbiter_core ch0_vc0_mature Arbiter_core inj_vc0_mature Arbiter_core ch2_vc1_mature Arbiter_core ch1_vc1_mature Arbiter_core ch0_vc1_mature Arbiter_core inj_vc1_mature Arbiter_core greedy_mode Arbiter_core requests pending Arbiter_core requests waiting (ready to go) Arbiter class 0 wins Arbiter class 2 wins Arbiter class 5 wins Arbiter class 6 wins Arbiter local client busy (reception) Arbiter local client busy (injection) Arbiter receiver 2 abort Arbiter receiver 1 abort Arbiter receiver 0 abort Arbiter local client abort Arbiter receiver 2 packet taken

15 115 116 118 119 120 121 122 123 124 125 126 127 128 130 133 134 147 151 153 154 155 156 157

BGL_2NDFPU_LDST_QUAD_ST BGL_UPC_TR_ARB_CH0_VC1_HAVE BGL_UPC_TR_ARB_INJ_VC1_HAVE BGL_UPC_TR_ARB_CORE_CH1_VC0_MATURE BGL_UPC_TR_ARB_CORE_CH0_VC0_MATURE BGL_UPC_TR_ARB_CORE_INJ_VC0_MATURE BGL_UPC_TR_ARB_CORE_CH2_VC1_MATURE BGL_UPC_TR_ARB_CORE_CH1_VC1_MATURE BGL_UPC_TR_ARB_CORE_CH0_VC1_MATURE BGL_UPC_TR_ARB_CORE_INJ_VC1_MATURE BGL_UPC_TR_ARB_CORE_GREEDY_MODE BGL_UPC_TR_ARB_CORE_REQ_PEND BGL_UPC_TR_ARB_CORE_REQ_WAITING_ RDY_GO BGL_UPC_TR_ARB_CLASS0_WINS BGL_UPC_TR_ARB_CLASS2_WINS BGL_UPC_TR_ARB_CLASS5_WINS BGL_UPC_TR_ARB_CLASS6_WINS BGL_UPC_TR_ARB_LOCAL_CLIENT_BUSY_ REC BGL_UPC_TR_ARB_LOCAL_CLIENT_BUSY_ INJ BGL_UPC_TR_ARB_RCV2_ABORT BGL_UPC_TR_ARB_RCV1_ABORT BGL_UPC_TR_ARB_RCV0_ABORT BGL_UPC_TR_ARB_LOCAL_CLIENT_ABORT BGL_UPC_TR_ARB_RCV2_PKT_TKN

116

Blue Gene/L: Performance Analysis Tools

Counter definition ID 15 Counter ID 158 159 160 261 273 285 297 309 321 327 Counter name BGL_UPC_TR_ARB_RCV1_PKT_TKN BGL_UPC_TR_ARB_RCV0_PKT_TKN BGL_UPC_TR_ARB_LOCAL_CLIENT_PKT_TKN BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description Arbiter receiver 1 packet taken Arbiter receiver 0 packet taken Arbiter local client packet taken XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

Table C-25 BGLPERFDEF table: Counter definition ID 16 Counter definition ID 16 Counter ID 10000 10001 10002 10003 16 17 18 19 20 21 22 23 24 25 26 Counter name BGL_FPU_DERIVED_COUNTER_1 BGL_FPU_DERIVED_COUNTER_2 BGL_FPU_DERIVED_COUNTER_3 BGL_FPU_DERIVED_COUNTER_4 BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_ALRDY_WAY_ DDR BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_L3_EDRAM_ACCESS_CYCLE BGL_UPC_L3_EDRAM_RFR_CYCLE BGL_UPC_L3_LINE_STARTS_EVICT_LINE_NUM_ PRESSURE BGL_UPC_L3_MISS_DIR_SET_DISBL BGL_UPC_L3_MISS_NO_WAY_SET_AVAIL BGL_UPC_L3_MISS_REQUIRING_CASTOUT BGL_UPC_L3_MISS_REQUIRING_REFILL_NO_WR_ ALLOC BGL_UPC_L3_MSHNDLR_TOOK_REQ Counter description Round robins FPU counters 0, 4, 8, 12 Round robins FPU counters 1, 5, 9, 13 Round robins FPU counters 2, 6, 10, 14 Round robins FPU counters 3, 7, 11, 15 Cache hit Cache miss; data already on the way from DDR Cache miss; data will be requested from DDR EDRAM access cycle EDRAM refresh cycle Line starts to evict due to line numb pressure Miss; but this directory set is disabled Miss and no way in this set is available Miss requiring a castout Miss requiring a refill (no write allocation) Miss handler took request

Appendix C. Perfmon database table specifications

117

Counter definition ID 16 Counter ID 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 261 273 285 297 309 321 327 Counter name BGL_UPC_L3_MSHNDLR_TOOK_REQ_PLB_RDQ BGL_UPC_L3_MSHNDLR_TOOK_REQ_RDQ0 BGL_UPC_L3_MSHNDLR_TOOK_REQ_RDQ1 BGL_UPC_L3_MSHNDLR_TOOK_REQ_WRBUF BGL_UPC_L3_PAGE_CLOSE BGL_UPC_L3_PAGE_OPEN BGL_UPC_L3_PLB_WRQ_DEP_DBUF BGL_UPC_L3_PLB_WRQ_DEP_DBUF_HIT BGL_UPC_L3_PREF_REINS_PULL_OUT_NEXT_LINE BGL_UPC_L3_PREF_REQ_ACC_BY_PREF_UNIT BGL_UPC_L3_RD_BURST_1024B_LINE_RD BGL_UPC_L3_RD_EDR__ALL_KINDS_OF_RD BGL_UPC_L3_RD_MODIFY_WR_CYCLE_EDR BGL_UPC_L3_REQ_TKN_CACHE_INHIB_RD_REQ BGL_UPC_L3_REQ_TKN_CACHE_INHIB_WR BGL_UPC_L3_REQ_TKN_NEEDS_CASTOUT BGL_UPC_L3_REQ_TKN_NEEDS_REFILL BGL_UPC_TS_XM_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS BGL_PERFCTR_NULL_EVENT Counter description Miss handler took request from PLB read queue Miss handler took request from read queue 0 Miss handler took request from read queue 1 Miss handler took request from write buffer Page close occurred Page open occurred PLB write queue deposits data into buffer PLB write queue deposits data into buffer (hit) Prefetch reinserted to pull out next line Prefetch request accepted by prefetch unit Read burst (1024b line read) occurred Read from EDR occurred (all kinds of read) Read-modify-write cycle to EDR occurred Request taken is a cache inhibited read request Request taken is a cache inhibited write Request taken needs castout Request taken needs refill XM packets XP packets YM packets YP packets ZM packets ZP packets Null event

118

Blue Gene/L: Performance Analysis Tools

Table C-26 BGLPERFDEF table: Counter definition ID 1000 Counter definition ID 1000 Counter ID 0 4 8 12 Counter name BGL_FPU_ARITH_ADD_SUBTRACT BGL_FPU_LDST_DBL_LD BGL_2NDFPU_ARITH_ADD_SUBTRACT BGL_2NDFPU_LDST_DBL_LD Counter description Add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Second FPU add and subtract, fadd, fadds, fsub, fsubs (Book E add, subtract) Second FPU double loads, lfd, lfdx, lfdu, lfdux, lfsdx, lfsdux (double word loads, no single precision) Cache hit Cache miss; data will be requested from DDR DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit Injection vc0 payload added Injection vc1 payload added Reception vc0 payload taken Reception vc1 payload taken XM packets YM packets YP packets XP packets ZM packets ZP packets

16 18 53 55 64 73 74 82 84 93 102 103 241 242 250 251 261 285 297 273 309 321

BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_TR_INJ_VC0_PYLD_ADDED BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TS_XM_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS

Appendix C. Perfmon database table specifications

119

Table C-27 BGLPERFDEF table: Counter definition ID 1001 Counter definition ID 1001 Counter ID 1 5 9 13 Counter name BGL_FPU_ARITH_MULT_DIV BGL_FPU_LDST_DBL_ST BGL_2NDFPU_ARITH_MULT_DIV BGL_2NDFPU_LDST_DBL_ST Counter description Multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Second FPU multiplication and divisions, fmul, fmuls, fdiv, fdivs (Book E mul, div) Second FPU double store, stfd, stfdx, stfdu, stfdux, stfsdx, stfsdux (double word stores, no single precision) Cache hit Cache miss; data will be requested from DDR DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit Injection vc0 payload added Injection vc1 payload added Reception vc0 payload taken Reception vc1 payload taken XM packets YM packets YP packets XP packets ZM packets ZP packets

16 18 53 55 64 73 74 82 84 93 102 103 241 242 250 251 261 285 297 273 309 321

BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_TR_INJ_VC0_PYLD_ADDED BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TS_XM_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS

120

Blue Gene/L: Performance Analysis Tools

Table C-28 BGLPERFDEF table: Counter definition ID 1002 Counter definition ID 1002 Counter ID 2 Counter name BGL_FPU_ARITH_OEDIPUS_OP Counter description Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Second FPU Oedipus operations, all symmetric, asymmetric, and complex Oedipus multiply-add instructions Second FPU quad loads, lfpdx, lfpdux, lfxdx, lfxdux (quad word loads) Cache hit Cache miss; data will be requested from DDR DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit Injection vc0 payload added Injection vc1 payload added Reception vc0 payload taken Reception vc1 payload taken XM packets YM packets YP packets XP packets ZM packets ZP packets

6 10

BGL_FPU_LDST_QUAD_LD BGL_2NDFPU_ARITH_OEDIPUS_OP

14 16 18 53 55 64 73 74 82 84 93 102 103 241 242 250 251 261 285 297 273 309 321

BGL_2NDFPU_LDST_QUAD_LD BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_TR_INJ_VC0_PYLD_ADDED BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TS_XM_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS

Appendix C. Perfmon database table specifications

121

Table C-29 BGLPERFDEF table: Counter definition ID 1003 Counter definition ID 1003 Counter ID 3 7 11 Counter name BGL_FPU_ARITH_TRINARY_OP BGL_FPU_LDST_QUAD_ST BGL_2NDFPU_ARITH_TRINARY_OP Counter description Trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Second FPU trinary operations, fmadd, fmadds, fnmadd, fnmadds, fmsub, fmsubs, fnmsub, fnmsubs (Book E fmadd) Second FPU quad store, stfpdx, stfpdux, stfxdx, stfxdux (quad word stores) Cache hit Cache miss; data will be requested from DDR DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit Injection vc0 payload added Injection vc1 payload added Reception vc0 payload taken Reception vc1 payload taken XM packets YM packets YP packets XP packets ZM packets ZP packets

15 16 18 53 55 64 73 74 82 84 93 102 103 241 242 250 251 261 285 297 273 309 321

BGL_2NDFPU_LDST_QUAD_ST BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_WILL_ BE_REQED_DDR BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_TR_INJ_VC0_PYLD_ADDED BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TS_XM_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS

122

Blue Gene/L: Performance Analysis Tools

Table C-30 BGLPERFDEF table: Counter definition ID 1004 Counter definition ID 1004 Counter ID 10000 10001 10002 10003 16 18 53 55 64 73 74 82 84 93 102 103 241 242 250 251 261 285 297 273 309 321 Counter name BGL_FPU_DERIVED_COUNTER_1 BGL_FPU_DERIVED_COUNTER_2 BGL_FPU_DERIVED_COUNTER_3 BGL_FPU_DERIVED_COUNTER_4 BGL_UPC_L3_CACHE_HIT BGL_UPC_L3_CACHE_MISS_DATA_WILL_BE_ REQED_DDR BGL_UPC_PU0_DCURD_BLIND_REQ BGL_UPC_PU0_DCURD_L3_REQ BGL_UPC_PU0_DCURD_SRAM_REQ BGL_UPC_PU0_PREF_REQ_VALID BGL_UPC_PU0_PREF_SELF_HIT BGL_UPC_PU1_DCURD_BLIND_REQ BGL_UPC_PU1_DCURD_L3_REQ BGL_UPC_PU1_DCURD_SRAM_REQ BGL_UPC_PU1_PREF_REQ_VALID BGL_UPC_PU1_PREF_SELF_HIT BGL_UPC_TR_INJ_VC0_PYLD_ADDED BGL_UPC_TR_INJ_VC1_PYLD_ADDED BGL_UPC_TR_REC_VC0_PYLD_TKN BGL_UPC_TR_REC_VC1_PYLD_TKN BGL_UPC_TS_XM_PKTS BGL_UPC_TS_YM_PKTS BGL_UPC_TS_YP_PKTS BGL_UPC_TS_XP_PKTS BGL_UPC_TS_ZM_PKTS BGL_UPC_TS_ZP_PKTS Counter description Round robins FPU counters 0, 4, 8, 12 Round robins FPU counters 1, 5, 9, 13 Round robins FPU counters 2, 6, 10, 14 Round robins FPU counters 3, 7, 11, 15 Cache hit Cache miss; data will be requested from DDR DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit DCURD BLIND request DCURD L3 request DCURD SRAM request Prefetch request valid Prefetch self hit Injection vc0 payload added Injection vc1 payload added Reception vc0 payload taken Reception vc1 payload taken XM packets YM packets YP packets XP packets ZM packets ZP packets

Appendix C. Perfmon database table specifications

123

124

Blue Gene/L: Performance Analysis Tools

Appendix D.

gmon support on Blue Gene/L


gmon is a GNU-based technology that allows node-specific application profiling. This appendix describes the usage of gmon support on Blue Gene/L.

Copyright IBM Corp. 2006. All rights reserved.

125

How to enable gmon profiling


On all platforms, three levels of profiling are available with gmon, depending on the use of the -pg and -g options on the compile and link commands. The level that is used depends on the amount of detail desired and the amount of overhead that is acceptable. Timer tick profiling This level of profiling provides timer tick profiling information at the machine instruction level. To enable this type of profiling, add the -pg option on the link command, but no additional options on the compile commands. This level of profiling adds the least amount of performance collection overhead. Procedure level profiling with timer tick information This level of profiling provides call graph information. To enable this level of profiling, include the -pg option on all compile commands and on the link command. In addition to call level profiling, you receive profiling information at the machine instruction level. This level of profiling adds additional overhead during performance data collection. When using higher levels of optimization, the entire call flow might not be viewable due to inlining and other optimizations performed by the compiler. Full level of profiling To enable all available profiling for a program, the -pg and -g options should be added to all compile commands and the link command. This level of profiling provides you with profiling information that can be used to create call graph information, statement level profiling, basic block profiling, and machine instruction profiling. It introduces the most overhead while collecting performance data. When higher levels of compiler optimization are used, the statement mappings and procedure calls might not appear as expected due to inlining, code movement, scheduling, and other optimizations performed by the compiler.

Additional function in Blue Gene/L gmon support


The basic gmon support is described in the man pages for the GNU toolchain. The additional function explained in the following sections is available in the Blue Gene/L toolchain.

Multiple gmon.out.x files


On Blue Gene/L, if the application runs on multiple nodes, each node generates a gmon.out.x file, where x corresponds to the rank of the node where it was run.

Enabling or disabling profiling within your application


To turn profiling on and off within your application, the application must still be compiled with the -g or -pg options as described previously. By inserting the following procedures at various points in the application, the user can enable and disable profile data collection and only collect data for the significant sections of the application: __moncontrol(1) turns on profiling. __moncontrol(0) turns off profiling.

126

Blue Gene/L: Performance Analysis Tools

Collecting gmon data as set of program counter values instead of as histogram


Performance data can be collected as a set of program counters instead of generating a histogram. To enable this type of collection, set the environment variable as GMON_SAMPLE_DATA="yes" before you run your program. This setting causes the gmon data collection to consist of the set of program counters that were executing at the end of each interval instead of generating a histogram during execution. When data is collected this way, the output files are named gmon.sample.x instead of gmon.out.x. In most cases, this file is much smaller than the gmon.out.x file, and it allows the user to see the sequence of execution instead of the summarized profile. The gprof tool in the Blue Gene/L toolchain has been updated to read this type of file.

Enhancements to gprof in the Blue Gene/L toolchain


The following sections describe the main enhancements made to gprof for Blue Gene/L.

Using gprof to read gmon.sample.x files


The version of gprof in the Blue Gene/L toolchain has been modified to recognize and process gmon.sample.x files as described previously. When using gprof on a sample file, gprof generates the same type of report as it does for gmon.out.x files. If the -sum option is added, gprof generates a gmon.sum file that is in normal gmon.out format from the data in the gmon.sample.x file(s). The -d option displays the program counter values in the order in which they were collected.

Using gprof to merge a very large number of gmon.out.x files


The current version of gprof has a limit on the number of gmon.out.x files that can be merged in one command invocation. This is due to the Linux limit on input arguments to a command. A new option has been added to gprof to allow merging of an unlimited number of gmon.out.x files. Consider the following usage example, where pgm is the program that was profiled:
/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gprof -sumbg pgm

This command searches the current directory for all gmon.out files of the form gmon.out.x, where x is an integer value, starting with 0, until there is a file in the sequence that cannot be found. The data in these files is summed in the same way as gprof normally does:
/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gprof -sumbg=gmon.sample pgm

Again pgm is the program that was profiled. As in the previous case, this command searches the current directory for all gmon.sample files of the form gmon.sample.x, where x is an integer value, starting with 0, until there is a file in the sequence that cannot be found. A gmon histogram is generated by summing the data found in each individual file, and the output goes to gmon.sum.

Appendix D. gmon support on Blue Gene/L

127

128

Blue Gene/L: Performance Analysis Tools

Glossary
32b executable Executable binaries (user applications) with 32b (4B) virtual memory addressing. Note that this is independent of the number of bytes (4 or 8) used for floating-point number representation and arithmetic. 32b floating-point arithmetic Executable binaries (user applications) with 32b (4B) floating-point number representation and arithmetic. Note that this is independent of the number of bytes (4 or 8) used for memory reference addressing. 32b virtual memory addressing All virtual memory addresses in a user application are 32b (4B) integers. Note that this is independent of the type of floating-point number representation and arithmetic. 64b executable Executable binaries (user applications) with 64b (8B) virtual memory addressing. Note that this is independent of the number of bytes (4 or 8) used for floating-point number representation and arithmetic. Also, all user applications should be compiled, loaded with subcontractor-supplied libraries, and executed with 64b virtual memory addressing by default. 64b floating-point arithmetic Executable binaries (user applications) with 64b (8B) floating-point number representation and arithmetic. Note that this is independent of the number of bytes (4 or 8) used for memory reference addressing. 64b virtual memory addressing All virtual memory addresses in a user application are 64b (8B) integers. Note that this is independent of the type of floating-point number representation and arithmetic. Also all user applications should be compiled, loaded with subcontractor-supplied libraries, and executed with 64b virtual memory addressing by default. Advanced Simulation and Computing Program (ASCI) Administered by Department of Energy (DOE)/National Nuclear Security Agency (NNSA). API See application programming interface. application programming interface (API) Defines the syntax and semantics for invoking services from within an executing application. All APIs shall be available to both Fortran and C programs, although implementation issues, such as whether the Fortran routines are simply wrappers for calling C routines, are up to the supplier. Application Specific Integrated Circuit (ASIC) Includes two 32-bit PowerPC cores (the 440) that was developed by IBM for embedded applications. ASCI See Advanced Simulation and Computing Program. ASIC See Application Specific Integrated Circuit. BGL See Blue Gene/L. BGL8K The Phase 1 build of Blue Gene/L, which contains 8192 Compute Nodes (CN), 128 I/O Nodes, one-eighth of the I/O subsystem and the all of the Front End Nodes. BGL Compute ASIC (BLC) This high-function Blue Gene/L ASCI is the basis of the Compute Nodes and I/O Nodes. BGL Link (BLL) ASIC This high-function Blue Gene/L ASCI is responsible for redriving communication signals between midplanes and is used to repartition Blue Gene/L. bit (b) A single, indivisible binary unit of electronic information. BLC See BGL Compute ASIC. BLL BGL Link. Blue Gene/L (BGL) The name given to the collection of Compute Nodes, I/O Nodes, Front End Nodes (FEN), file systems, and interconnecting networks that is the subject of this statement of work. byte (B) A collection of eight bits. central processing unit (CPU) or processor A VLSI chip that constitutes the computational core (integer, floating point, and branch units), registers, and memory interface (virtual memory translation, Translation Lookaside Buffer (TLB) and bus controller). cluster A set of nodes connected via a scalable network technology. Cluster Monitoring and Control System (CMCS) Cluster Wide File System (CWFS) The file system that is visible from every node in the system with scalable performance. CMCS Cluster Monitoring and Control System. CMN See Control and Management Network. CN See Compute Node.

Copyright IBM Corp. 2006. All rights reserved.

129

compute card One of the field replaceable units (FRUs) of Blue Gene/L. Contains two complete Compute Nodes, and is plugged into a node card. Compute Node (CN) The element of Blue Gene/L that supplies the primary computational resource for execution of a user application. Control and Management Network (CMN) Provides a command and control path to Blue Gene/L for functions such as health status monitoring, repartitioning, and booting. Core Subcontractor delivered hardware and software. The Blue Gene/L Core consists of the Blue Gene/L Compute Main Section, Front End Node, Service Node (SN), and a control and management Ethernet. CPU See central processing unit. current standard (as applied to system software and tools) Applies when an API is not frozen on a particular version of a standard, but shall be upgraded automatically by the subcontractor as new specifications are released. For example, MPI version 2.0 refers to the standard in effect at the time of writing this document, while current version of MPI refers to further versions that take effect during the lifetime of this contract. CWFS See Cluster Wide File System. DDR See Double Data Rate. Double Data Rate (DDR) A technique for doubling the switching rate of a circuit by triggering on both the rising edge and falling edge of a clock signal. EDRAM See enhanced dynamic random access memory. enhanced dynamic random access memory (EDRAM) Dynamic random access memory that includes a small amount of static random access memory (SRAM) inside a larger amount of DRAM. Performance is enhanced by organizing so that many memory accesses are to the faster SRAM. ETH The ETH is a high-function Blue Gene/L ASIC that is responsible for Ethernet-to-JTAG conversion and other control functions. Federated Gigabit-Ethernet Switch (FGES) Connects the I/O Nodes of Blue Gene/L to external resources, such as the FEN and the CWFS. FEN See Front End Node. FGES See Federated Gigabit-Ethernet Switch. Field Replaceable Unit (FRU)

Floating Point Operation (FLOP or OP) Plural is FLOPS or OPS. FLOP or OP See Floating Point Operation. FLOP/s or OP/s Floating Point Operation per second. Front End Node (FEN) Is responsible, in part, for interactive access to Blue Gene/L. FRU Field Replaceable Unit. fully supported (as applied to system software and tools) Refers to product-quality implementation, documented and maintained by the HPC machine supplier or an affiliated software supplier. GFLOP/s, GOP/s, gigaFLOP/s A billion (109 = 1000000000) 64-bit floating point operations per second. gibibyte (GiB) A billion base 2 bytes. This is typically used in terms of RAM and is 230 (or 1073741824) bytes. For a complete description of SI units for prefixing binary multiples, see: http://physics.nist.gov/cuu/Units/binary.html gigabyte (GB) A billion base 10 bytes. This is typically used in every context except for RAM size and is 109 (or 1000000000) bytes. host complex Includes the Front End Node and Service Node. Hot Spare Node (HSN) HSN Hot Spare Node. Internet Protocol (IP) The method by which data is sent from one computer to another on the Internet. IP See Internet Protocol. job A cluster wide abstraction similar to a POSIX session, with certain characteristics and attributes. Commands shall be available to manipulate a job as a single entity (including kill, modify, query characteristics, and query state). input/output (I/O) Describes any operation, program, or device that transfers data to or from a computer. I/O card One of the FRUs of Blue Gene/L. An I/O card contains two complete I/O Nodes and is plugged into a node card. I/O Node (ION) Are responsible, in part, for providing I/O services to Compute Nodes. International Business Machines Corporation (IBM) ION See I/O Node.

130

Blue Gene/L: Performance Analysis Tools

limited availability Represents an intermediate operational level of major computing systems at LLNL. Limited availability is characterized by system access limited to a select set of users, with reduced system functionality. LINPACK A collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems. Linux A free UNIX-like operating system originally created by Linus Torvalds with the assistance of developers around the world. Developed under the GNU General Public License, the source code for Linux is freely available to everyone. Mean Time Between Failure (MTBF) A measurement of the expected reliability of the system or component. The MTBF figure can be developed as the result of intensive testing, based on actual product experience, or predicted by analyzing known factors. See: http://www.t-cubed.com/faq_mtbf.htm mebibyte (MiB) A million base 2 bytes. This is typically used in terms of Random Access Memory and is 220 (or 1048576) bytes. For a complete description of SI units for prefixing binary multiples, see: http://physics.nist.gov/cuu/Units/binary.html megabyte (MB) A million base 10 bytes. This is typically used in every context except for RAM size and is 106 (or 1000000) bytes. Message Passing Interface (MPI) MFLOP/s, MOP/s, or megaFLOP/s A million (106 = 1000000) 64-bit floating point operations per second. midplane An intermediate packaging component of Blue Gene/L. Multiple node cards plug into a midplane to form the basic scalable unit of Blue Gene/L. MPI See Message Passing Interface. MPICH2 MPICH is an implementation of the MPI standard available from Argonne National Laboratory. MTBF See Mean Time Between Failure. node Operates under a single instance of an operating-system image and is an independent operating-system partition. node card An intermediate packaging component of Blue Gene/L. FRUs (compute cards and I/O cards) are plugged into a node card. Multiple node cards plug into a midplane to form the basic scalable unit of Blue Gene/L. OCF See Open Computing Facility.

Open Computing Facility (OCF) The unclassified partition of Livermore Computing, the main scientific computing complex at LLNL. OpenMP A portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications. peak rate The maximum number of 64-bit floating point instructions (add, subtract, multiply or divide) per second that can conceivably be retired by the system. For RISC CPUs, the peak rate is calculated as the maximum number of floating point instructions retired per clock times the clock rate. PTRACE A facility that allows a parent process to control the execution of a child process. Its primary use is for the implementation of breakpoint debugging. published (as applied to APIs) Refers to the situation where an API is not required to be consistent across platforms. A published API refers to the fact that the API shall be documented and supported, although it by a subcontractor or platform specific. Purple ASCI Purple is the fourth generation of ASCI platforms. RAID See redundant array of independent disks. RAM See random access memory. random access memory (RAM) Computer memory in which any storage location can be accessed directly. RAS See reliability, availability, and serviceability. redundant array of independent disks (RAID) A collection of two or more disk physical drives that present to the host an image of one or more logical disk drives. In the event of a single physical device failure, the data can be read or regenerated from the other disk drives in the array due to data redundancy. reliability, availability, and serviceability (RAS) Include those aspects of hardware and software design and development, solution design and delivery, manufacturing quality, technical support service and other services which contribute to assuring that the IBM offering will be available when the client wants to use it; that it will reliably perform the job; that if failures do occur, they will be nondisruptive and be repaired rapidly and that after repair the user may resume operations with a minimum of inconvenience. SAN See storage area network.

Glossary

131

scalable A system attribute that increases in performance or size as some function of the peak rating of the system. The scaling regime of interest is at least within the range of 1 teraflop/s to 60.0 (and possibly to 120.0) teraflop/s peak rate. SDRAM See synchronous, dynamic random access memory. Service Node Is responsible, in part, for management and control of Blue Gene/L. service representative On-site hardware expert who performs hardware maintenance with DOE Q-clearance. single-point control (as applied to tool interfaces) The ability to control or acquire information about all processes or PEs using a single command or operation. Single Program Multiple Data (SPMD) A programming model wherein multiple instances of a single program operate on multiple data. SMFS See System Management File System. SMP See symmetric multiprocessor. SNL See Sandia National Laboratories. SOW See Statement of Work. SPMD See Single Program Multiple Data. sPPM This is a benchmark that solves a 3D gas dynamics problem on a uniform Cartesian mesh, using a simplified version of the Piecewise Parabolic Method (PPM) code. SRAM static random access memory. standard (as applied to APIs) Where an API is required to be consistent across platforms, the reference standard is named as part of the capability. The implementation shall include all routines defined by that standard, even if some simply result in no-ops on a given platform. Statement of Work (SOW) This document is a statement of work. A document prepared by a Project Manager (PM) as a response to a Request for Service from a client. The project SOW is the technical solution proposal, and it should describe the deliverables and identify all Global Services risks and impacts, infrastructure investments, capacity, cost elements, assumptions and dependencies. static random access memory (SRAM) storage area network (SAN) A high-speed subnetwork of storage devices.

symmetric multiprocessor (SMP) A computing node in which multiple functional units operate under the control of a single operating-system image. synchronous, dynamic random access memory (SDRAM) A type of dynamic random access memory (DRAM) with features that make it faster than standard DRAM. System Management File System (SMFS) Provides a single, central location for administrative information about Blue Gene/L. TCP/IP See Transmission Control Protocol/Internet Protocol. tebibyte (TiB) A trillion bytes base 2 bytes. This is typically used in terms of Random Access Memory and is 240 (or 1099511627776) bytes. For a complete description of SI units for prefixing binary multiples, see: http://physics.nist.gov/cuu/Units/binary.html terabyte (TB) A trillion base 10 bytes. This is typically used in every context except for Random Access Memory size and is 1012 (or 1000000000000) bytes. teraflop/s (TFLOP/s) A trillion (1012 = 1000000000000) 64-bit floating point operations per second. tori The plural form of the word torus. torus network Each processor is directly connected to six other processors: two in the X dimension, two in the Y dimension, and two in the Z dimension. One of the easiest ways to picture a torus is to think of a 3-D cube of processors, where every processor on an edge has wraparound connections to link to other similar edge processors. TotalView A parallel debugger from Etnus LLC, Natick, MA. Transmission Control Protocol/Internet Protocol (TCP/IP) The suite of communications protocols used to connect hosts on the Internet. Tri-Lab Includes Los Alamos National Laboratory, Lawrence Livermore National Laboratory, and Sandia National Laboratories. UMT2000 The UMT benchmark is a 3D, deterministic, multigroup, photon transport code for unstructured meshes. Unified Parallel C (UPC) A programming language with parallel extensions to ANSI C. For an example, see: http://upc.gwu.edu/

132

Blue Gene/L: Performance Analysis Tools

University Alliances Members of the Academic Strategic Alliances Program (ASAP) of ASCI, academic institutions engaged in accelerating simulation science. UPC See Unified Parallel C. XXX-compatible (as applied to system software and tool definitions) Requires that a capability be compatible, at the interface level, with the referenced standard, although the lower-level implementation details will differ substantially. For example, NFSv4-compatible means that the distributed file system shall be capable of handling standard NFSv4 requests, but need not conform to NFSv4 implementation specifics.

Glossary

133

134

Blue Gene/L: Performance Analysis Tools

Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this redbook.

IBM Redbooks
For information about ordering these publications, see How to get IBM Redbooks on page 136. Note that some of the documents referenced here may be available in softcopy only. Workload Management with LoadLeveler, SG24-6038 Linux Clustering with CSM and GPFS, SG24-6601 Blue Gene/L: Hardware Overview and Planning, SG24-6796 Blue Gene/L: System Administration, SG24-7178 Blue Gene/L: Application Development, SG24-7179

Other publications
These publications are also relevant as further information sources: General Parallel File System (GPFS) for Clusters: Concepts, Planning, and Installation, GA22-7968 IBM General Information Manual, Installation Manual-Physical Planning, GC22-7072 LoadLeveler for AIX 5L and Linux V3.2 Using and Administering, SA22-7881

Online resources
These Web sites and URLs are also relevant as further information sources: MPI-2 Reference
http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm

Etnus TotalView
http://www.etnus.com/

GDB: The GNU Project Debugger


http://www.gnu.org/software/gdb/

SUSE LINUX Enterprise Server


http://www.novell.com/products/linuxenterpriseserver/

Copyright IBM Corp. 2006. All rights reserved.

135

How to get IBM Redbooks


You can search for, view, or download Redbooks, Redpapers, Hints and Tips, draft publications and Additional materials, as well as order hardcopy Redbooks or CD-ROMs, at this Web site:
ibm.com/redbooks

Help from IBM


IBM Support and downloads
ibm.com/support

IBM Global Services


ibm.com/services

136

Blue Gene/L: Performance Analysis Tools

Index
A
ACTC (Advanced Computing Technology Center) 2 Advanced Computing Technology Center (ACTC) 2 analysis 5 overview 12 External Performance Instrumentation Facility (Perfmon) 7

F
full level of profiling 126

B
bgl_perfctr usage example 66 BGLPERFDATA table 85 BGLPERFDEF table 83 join with BGLPERFDESC 97 BGLPERFDESC table 83, 86 join with BGLPERFDEF 97 BGLPERFINST table 82 BGLPERFJOB table 84 BGLPERFLOCATION table 84 BGLPERFSAMPLES table 85 Blue Gene/L tooling overview 2

G
gmon additional function for Blue Gene/L 126 collecting data as program counter values 127 gmon profiling enablement 126 full level 126 procedure level with timer tick information 126 timer tick 126 gmon.out.x file 126127 gmon.sample.x file 127 gprof enhancements in Blue Gene/L toolchain 127 merge of large number of gmon.out.x files 127 reading of gmon.sample.x files 127

C
Candygram package 43 counter definition ID 15 counter definitions 15 CPU performance 4

H
hardware performance monitor 4 libraries 58 High Performance Computing Toolkit 23

D
database organization 82 detailed sample type 15 detailed samples 18 dsp_perfmon command 13, 21, 54

I
I/O performance 5 IBM High Performance Computing Toolkit 2 imp_perfmon_data command 13, 16, 20, 43, 54 interval timer 12

E
electromagnetic compatibility 79 end_perfmon command 13, 43 EPIF (External Performance Instrumentation Facility) 11 EPIF command 17 dsp_perfmon 21 end_perfmon 43 exp_perfmon_data 43 ext_perfmon_data 39 imp_perfmon_data 43 perfmon 17 typical uses 44 event mapping 56 exp_perfmon_data command 13, 43, 54 ext_perfmon_data command 13, 39 options 49 External Performance Instrumentation Facility (EPIF) 8, 11 basic concepts 13 goals and strategies 13 objectives 12 options 44

K
kill command 19 KOJAK 3

L
libmass.a 6 libmassv.a 6 linux-bgl PAPI substrate 56

M
MASS (Mathematical Acceleration Subsystem) 6 MASS and MASSV libraries 6 Mathematical Acceleration Subsystem (MASS) 6 message passing performance 3 MIO (modular I/O) 5 MMCS database organization 82

Copyright IBM Corp. 2006. All rights reserved.

137

modular I/O (MIO) 5 MPE/jumpshot 3 MPI profiling tools 2 MPI Tracer and Profiler 3 mx package 44

R
Redbooks Web site 136 Contact us ix restriction 12

P
PAPI comparison with Perfmon 9 event mapping 56 implementation 56 library usage examples 58 linux-bgl substrate 56 modifications 58 PAPI (Performance Application Programming Interface) 78, 55 Paraver 3 PeekPerf 3, 5 PeekView 4 Perfmon 8, 11 comparison with PAPI 9 database table specifications 81 perfmon command 13, 17 advanced options 20 perfmon.pl 12 perfmon.py 12 Performance Application Programming Interface (PAPI) 78, 55 performance collection instance table 82 performance counters 55 performance data file table 85 performance definition table 83 performance description table 83 performance guidelines 1 performance job table 84 performance location table 84 performance samples definition table 85 performance testing 2 KOJAK 3 MASS and MASSV libraries 6 modular I/O 5 MPE/jumpshot 3 MPI Tracer and Profiler 3 MPI_Finalize 4 Paraver 3 PeekPerf 5 PeekView 5 TAU 3 tools on System p 2 tools ported to Blue Gene/L 3 Xprofiler 5 performance tool 1 comparison 7 procedure level profiling with timer tick information 126 profiling enablement and disablement in application 126 gmon 126 program counter values 127 PyDB2 package 44 Python packages 43

S
sample 14 sample interval 14, 18 sample type 15 scalar instrinsic routine 6 sets of counter definitions 97 SIGALRM 12 starting counter values 14, 18 startperfmon script 12 statement of completion 77 substrate 56 substrate interface 56 summary sample 18 summary sample type 15 System p tools for performance testing 2

T
TAU 3 timer tick profiling 126

V
vector intrinsic routine 6 visualization 5

X
Xprofiler 5

138

Blue Gene/L: Performance Analysis Tools

Blue Gene/L: Performance Analysis Tools

Blue Gene/L: Performance Analysis Tools


Blue Gene/L: Performance Analysis Tools

Blue Gene/L: Performance Analysis Tools

(0.2spine) 0.17<->0.473 90<->249 pages

Blue Gene/L: Performance Analysis Tools

Blue Gene/L: Performance Analysis Tools

Back cover

Blue Gene/L: Performance Analysis Tools


Learn about Blue Gene/L performance tooling Discover the details about PAPI and the External Performance Monitor Understand the pros and cons of the different tools
This IBM Redbook is one in a series of IBM publications written specifically for the IBM System Blue Gene supercomputer, Blue Gene/L, which was developed by IBM in collaboration with Lawrence Livermore National Laboratory (LLNL). This redbook provides an overview of the application development performance analysis environment for Blue Gene/L. This redbook explains some of the tools that are available to do application-level performance analysis. It devotes the majority of its content to Chapter 3, External Performance Instrumentation Facility on page 11, and Chapter 4, Performance Application Programming Interface on page 55.

INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION

BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE


IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.

For more information: ibm.com/redbooks


SG24-7278-00 ISBN 0738495867

Das könnte Ihnen auch gefallen