Sie sind auf Seite 1von 82

Debugging in Serial & Parallel

M. D. Jones, Ph.D.
Center for Computational Research University at Buffalo State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

1 / 89

Part I Basic (Serial) Debugging

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

2 / 89

Introduction

Software for Debugging

Software for Debugging

The most common method for debugging (by far) is the instrumentation method: One instruments the code with print statements to check values and follow the execution of the program Not exactly sophisticated - one can certainly debug code in this way, but wise use of software debugging tools can be more effective

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

4 / 89

Introduction

Software for Debugging

Debugging Tools

Debugging tools are abundant, but we will focus merely on some of the most common attributes to give you a bag of tricks that can be used when dealing with common problems.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

5 / 89

Introduction

Software for Debugging

Basic Capabilities

Common attributes: Divided into command-line or graphical user interfaces Usually have to recompile (-g is almost a standard option to enable debugging) your code to utilize most debugger features Invocation by name of debugger and executable (e.g. gdb ./a.out [core])

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

6 / 89

Introduction

Software for Debugging

Running Within

Inside a debugger (be it using a command-line interface (CLI) or graphical front-end), you have some very handy abilities: Look at source code listing (very handy when isolating an IEEE exception) Line-by-line execution Insert stops or breakpoints at certain functional points (i.e., when critical values change) Ability to monitor variable values Look at stack trace (or backtrace) when code crashes

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

7 / 89

Introduction

Software for Debugging

Command-line debugging example


Consider the following code example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # include < s t d i o . h> # include < s t d l i b . h> int indx ; void i n i t A r r a y ( i n t nelem_in_array , i n t a r r a y ) ; void p r i n t A r r a y ( i n t nelem_in_array , i n t a r r a y ) ; i n t squareArray ( i n t nelem_in_array , i n t a r r a y ) ; i n t main ( void ) { const i n t nelem = 1 0 ; i n t array1 , array2 , d e l ; / A l l o c a t e memory f o r each a r r a y / a r r a y 1 = ( i n t ) m a l l o c ( nelems i z e o f ( i n t ) ) ; a r r a y 2 = ( i n t ) m a l l o c ( nelems i z e o f ( i n t ) ) ; d e l = ( i n t ) m a l l o c ( nelems i z e o f ( i n t ) ) ; / I n i t i a l i z e a r r a y 1 / i n i t A r r a y ( nelem , a r r a y 1 ) ; f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) { array1 [ indx ] = indx + 2; }

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

8 / 89

Introduction

Software for Debugging

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

/ P r i n t t h e elements o f a r r a y 1 / p r i n t f ( " array1 = " ) ; p r i n t A r r a y ( nelem , a r r a y 1 ) ; / Copy a r r a y 1 to a r r a y 2 / array2 = array1 ; / Pass a r r a y 2 to t h e function squareArray ( ) / squareArray ( nelem , a r r a y 2 ) ; / Compute d i f f e r e n c e between elements o f a r r a y 2 and a r r a y 1 / f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) { del [ indx ] = array2 [ indx ] array1 [ indx ] ; } / P r i n t t h e computed d i f f e r e n c e s / p r i n t f ( " The d i f f e r e n c e i n t h e elements o f a r r a y 2 and a r r a y 1 are : p r i n t A r r a y ( nelem , d e l ) ; free ( array1 ) ; free ( array2 ) ; free ( del ) ; return 0; }

" );

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

9 / 89

Introduction

Software for Debugging

49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

v o i d i n i t A r r a y ( c o n s t i n t nelem_in_array , i n t a r r a y ) { f o r ( i n d x = 0 ; i n d x < n e le m_ i n_ ar r ay ; i n d x ++) { array [ indx ] = indx + 1; } } i n t squareArray ( c o n s t i n t nelem_in_array , i n t a r r a y ) { i n t indx ; f o r ( i n d x = 0 ; i n d x < n e le m_ i n_ ar r ay ; i n d x + + ) { a r r a y [ i n d x ] = a r r a y [ i n d x ] ; } return array ; } v o i d p r i n t A r r a y ( c o n s t i n t nelem_in_array , i n t a r r a y ) { printf ( " \n( " ); f o r ( i n d x = 0 ; i n d x < n e le m_ i n_ ar r ay ; i n d x + + ) { p r i n t f ( "%d " , a r r a y [ i n d x ] ) ; } p r i n t f ( " ) \ n" ) ; }

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

10 / 89

Introduction

Software for Debugging

Ok, now lets compile and run this code:


1 2 3 4 5 6 7 8 [ bono : ~ / d_debug ] $ gcc g o a r r a yex a r r a yex . c [ bono : ~ / d_debug ] $ . / a r r a yex array1 = ( 2 3 4 5 6 7 8 9 10 11 ) The d i f f e r e n c e i n t h e elements o f a r r a y 2 and a r r a y 1 are : ( 0 0 0 0 0 0 0 0 0 0 ) g l i b c d e t e c t e d double f r e e o r c o r r u p t i o n ( f a s t t o p ) : 0x0000000000501010 Aborted

Not exactly what we expect, is it? Array2 should contain the squares of the values in array1, and therefore the difference should be i 2 i for i = [2, 11].

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

11 / 89

Introduction

Software for Debugging

Now let us run the code from within gdb. Our goal is to set a breakpoint where the squared arrays elements are computed, then step through the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [ bono : ~ / d_debug ] $ gdb a r r a yex ( gdb ) l 34 31 32 / Copy a r r a y 1 t o a r r a y 2 / 33 array2 = array1 ; 34 35 / Pass a r r a y 2 t o t h e f u n c t i o n squareArray ( ) / 36 squareArray ( nelem , a r r a y 2 ) ; 37 38 / Compute d i f f e r e n c e between elements o f a r r a y 2 and a r r a y 1 / 39 f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) { 40 del [ indx ] = array2 [ indx ] array1 [ indx ] ; ( gdb ) b 34 B r e a k p o i n t 1 a t 0x400604 : f i l e ex1 . c , l i n e 3 4 . ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / a r r a yex array1 = ( 2 3 4 5 6 7 8 9 10 11 )

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

12 / 89

Introduction

Software for Debugging

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

B r e a k p o i n t 1 , main ( ) a t a r r a yex . c : 3 4 34 squareArray ( nelem , a r r a y 2 ) ; ( gdb ) s squareArray ( n el e m_ i n_ ar r ay =10 , a r r a y =0x501010 ) a t a r r a yex . c : 5 9 59 f o r ( i n d x = 0 ; i n d x < n e le m_ i n_ ar r ay ; i n d x + + ) { ( gdb ) s 60 a r r a y [ i n d x ] = a r r a y [ i n d x ] ; ( gdb ) s 59 f o r ( i n d x = 0 ; i n d x < n e le m_ i n_ ar r ay ; i n d x + + ) { ( gdb ) p i n d x $1 = 0 ( gdb ) p a r r a y [ i n d x ] $2 = 4 ( gdb ) d i s p l a y i n d x 1: indx = 0 ( gdb ) d i s p l a y a r r a y [ i n d x ] 2: array [ indx ] = 4 ( gdb ) s 60 a r r a y [ i n d x ] = a r r a y [ i n d x ] ; 2: array [ indx ] = 3 1: indx = 1

Ok, that is instructive, but no closer to nding the bug.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

13 / 89

Introduction

Software for Debugging

So, what have we learned so far about the command-line debugger: Useful for peaking inside source code (break) Breakpoints (s) Stepping through execution (p) Print values at selected points (can also use handy printf syntax as in C) (display) Displaying values for monitoring while stepping through code (bt) Backtrace, or Stack Trace - havent used this yet, but certainly will

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

14 / 89

Introduction

Software for Debugging

Digging Out the Bug


What we have learned is enough - look more closely at the line where the differences between array1 and array2 are computed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ( gdb ) l 38 33 / Pass a r r a y 2 t o t h e f u n c t i o n squareArray ( ) / 34 squareArray ( nelem , a r r a y 2 ) ; 35 36 / Compute d i f f e r e n c e between elements o f a r r a y 2 and a r r a y 1 / 37 f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) { 38 del [ indx ] = array2 [ indx ] array1 [ indx ] ; 39 } 40 41 / P r i n t t h e computed d i f f e r e n c e s / 42 p r i n t f ( " The d i f f e r e n c e i n t h e elements o f a r r a y 2 and a r r a y 1 are : ( gdb ) b 37 B r e a k p o i n t 1 a t 0x400611 : f i l e a r r a yex . c , l i n e 3 7 . ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / a r r a yex array1 = ( 2 3 4 5 6 7 8 9 10 11 )

");

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

15 / 89

Introduction

Software for Debugging

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

B r e a k p o i n t 1 , main ( ) a t a r r a yex . c : 3 7 37 f o r ( i n d x = 0 ; i n d x < nelem ; ( gdb ) d i s p i n d x 1 : i n d x = 10 ( gdb ) d i s p a r r a y 1 [ i n d x ] 2 : a r r a y 1 [ i n d x ] = 49 ( gdb ) d i s p a r r a y 2 [ i n d x ] 3 : a r r a y 2 [ i n d x ] = 49 ( gdb ) s 38 del [ indx ] = array2 [ indx ] 3: array2 [ indx ] = 4 2: array1 [ indx ] = 4 1: indx = 0 ( gdb ) s 37 f o r ( i n d x = 0 ; i n d x < nelem ; 3: array2 [ indx ] = 4 2: array1 [ indx ] = 4 1: indx = 0 ( gdb ) s 38 del [ indx ] = array2 [ indx ] 3: array2 [ indx ] = 9 2: array1 [ indx ] = 9 1: indx = 1

i n d x ++) {

array1 [ indx ] ;

i n d x ++) {

array1 [ indx ] ;

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

16 / 89

Introduction

Software for Debugging

Now that isnt right - array1 was not supposed to change. Let us go back and look more closely at the call to squareArray ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

17 / 89

Introduction

Software for Debugging

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

( gdb ) l 32 33 / Pass a r r a y 2 t o t h e f u n c t i o n squareArray ( ) / 34 squareArray ( nelem , a r r a y 2 ) ; 35 36 / Compute d i f f e r e n c e between elements o f a r r a y 2 and a r r a y 1 / 37 f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) { 38 del [ indx ] = array2 [ indx ] array1 [ indx ] ; 39 } 40 41 / P r i n t t h e computed d i f f e r e n c e s / ( gdb ) b 34 B r e a k p o i n t 2 a t 0x400605 : f i l e a r r a yex . c , l i n e 3 4 . ( gdb ) run The program being debugged has been s t a r t e d a l r e a d y . S t a r t i t from t h e b e g i n n i n g ? ( y o r n ) y S t a r t i n g program : / san / user / jonesm / u2 / d_debug / a r r a yex array1 = ( 2 3 4 5 6 7 8 9 10 11 ) B r e a k p o i n t 2 , main ( ) a t a r r a yex . c : 3 4 34 squareArray ( nelem , a r r a y 2 ) ; 3 : a r r a y 2 [ i n d x ] = 49 2 : a r r a y 1 [ i n d x ] = 49 1 : i n d x = 10 ( gdb ) d i s p a r r a y 2 4 : a r r a y 2 = ( i n t ) 0x501010 ( gdb ) d i s p a r r a y 1 5 : a r r a y 1 = ( i n t ) 0x501010

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

18 / 89

Introduction

Software for Debugging

Yikes, array1 and array2 point to the same memory location! See, pointer errors like this dont happen too often in Fortran ... Now , of course, the bug is obvious - but arent they all obvious after you nd them?

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

19 / 89

Introduction

Software for Debugging

The Fix Is In
Just as an afterthought, what we ought to have done in the rst place was copy array1 into array2:
/ Copy a r r a y 1 t o a r r a y 2 / / a r r a y 2 = a r r a y 1 ; / f o r ( i n d x =0; indx <nelem ; i n d x ++) { array2 [ indx ]= array1 [ indx ] ; }

which will nally produce the right output:


1 2 3 4 5 6 7 8 9 ( gdb ) run S t a r t i n g program : / home / jonesm / d_debug / ex1 array1 = ( 2 3 4 5 6 7 8 9 10 11 ) The d i f f e r e n c e i n t h e elements o f a r r a y 2 and a r r a y 1 are : ( 2 6 12 20 30 42 56 72 90 110 ) Program e x i t e d n o r m a l l y . ( gdb )

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

20 / 89

Array Indexing Errors

Array Indexing Errors

Array indexing errors are one of the most common errors in both sequential and parallel codes - and it is not entirely surprising: Different languages have different indexing defaults Multi-dimensional arrays are pretty easy to reference out-of-bounds Fortran in particular lets you use very complex indexing schemes (essentially arbitrary!)

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

22 / 89

Array Indexing Errors

Example: Indexing Error


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # include < s t d i o . h> # define N 10 i n t main ( i n t argc , char argv [ ] ) { i n t a r r [N ] ; i n t i , odd_sum , even_sum ; f o r ( i =1; i <(N1);++ i ) { i f ( i <=4) { a r r [ i ] = ( i i )%3; } else { a r r [ i ] = ( i i )%5; } } odd_sum =0; even_sum =0; f o r ( i =0; i <(N1);++ i ) { i f ( i %2==0) { even_sum += a r r [ i ] ; } else { odd_sum += a r r [ i ] ; } } p r i n t f ( " odd_sum=%d , even_sum=%d \ n " , odd_sum , even_sum ) ; }

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

23 / 89

Array Indexing Errors

Now, try compiling with gcc and running the code:


1 2 3 [ bono : ~ / d_debug ] $ gcc g o ex2 ex2 . c O [ bono : ~ / d_debug ] $ . / ex2 odd_sum=5 , even_sum=671173703

Ok, that hardly seems reasonable (does it?) Now, lets run this example from within gdb and set a breakpoint to examine the accumulation of values to even_sum.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

24 / 89

Array Indexing Errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

( gdb ) l 16 11 a r r [ i ] = ( i i )%5; 12 } 13 } 14 odd_sum =0; 15 even_sum =0; 16 f o r ( i =0; i <(N1);++ i ) { 17 i f ( i %2==0) { 18 even_sum += a r r [ i ] ; 19 } else { 20 odd_sum += a r r [ i ] ; ( gdb ) b 16 B r e a k p o i n t 1 a t 0x40051e : f i l e ex2 . c , l i n e 1 6 . ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / ex2 B r e a k p o i n t 1 , main ( argc= V a r i a b l e " argc " i s n o t a v a i l a b l e . ) a t ex2 . c : 1 6 16 f o r ( i =0; i <(N1);++ i ) { ( gdb ) p a r r $1 = {671173696 , 1 , 1 , 0 , 1 , 0 , 1 , 4 , 4 , 0 }

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

25 / 89

Array Indexing Errors

So we see that our original example code missed initializing the rst element of the array, and the results were rather erratic (in fact they will likely be compiler and ag dependent). Initialization is just one aspect of things going wrong with array indexing - let us examine another common problem ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

26 / 89

Array Indexing Errors

The (Infamous) Seg Fault

This example I borrowed from Norman Matloff (UC Davis), who has a nice article (well worth the time to read): Guide to Faster, Less Frustrating Debugging, which you can nd easily enough on the web: http://heather.cs.ucdavis.edu/~matloff/unix.html

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

27 / 89

Array Indexing Errors

Main code: ndprimes.c


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 / primenumber f i n d i n g program w i l l ( a f t e r bugs are f i x e d ) r e p o r t a l i s t o f a l l primes which are l e s s than o r equal t o t h e users u p p l i e d upper bound / # include < s t d i o . h> # define MaxPrimes 50 i n t Prime [ MaxPrimes ] , / Prime [ I ] w i l l be 1 i f I i s prime , 0 o t h e r w i s e / UpperBound ; / we w i l l check up t h r o u g h UpperBound f o r primeness / void CheckPrime ( i n t K ) ; / p r o t o t y p e f o r CheckPrime f u n c t i o n / i n t main ( ) { i n t N; p r i n t f ( " e n t e r upper bound \ n " ) ; s c a n f ( "%d " , UpperBound ) ; Prime [ 2 ] = 1 ; f o r (N = 3 ; N <= UpperBound ; N += 2 ) CheckPrime (N ) ; i f ( Prime [N ] ) p r i n t f ( "%d i s a prime \ n " ,N ) ; }

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

28 / 89

Array Indexing Errors

Function FindPrime:
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 void CheckPrime ( i n t K) { int J ; / t h e p l a n : see i f J d i v i d e s K , f o r a l l v a l u e s J which are ( a ) themselves prime ( no need t o t r y J i f i t i s nonprime ) , and ( b ) l e s s than o r equal t o s q r t (K) ( i f K has a d i v i s o r l a r g e r than t h i s square r o o t , i t must a l s o have a s m a l l e r one , so no need t o check f o r l a r g e r ones ) / J = 2; while ( 1 ) { i f ( Prime [ J ] == 1 ) i f ( K % J == 0 ) { Prime [ K ] = 0 ; return ; } J ++; } / i f we g e t here , then t h e r e were no d i v i s o r s o f K , so i t i s prime / Prime [ K ] = 1 ; }

so now if we compile and run this code ...


M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 29 / 89

Array Indexing Errors

1 2 3 4 5

[ bono : ~ / d_debug ] $ gcc g o f i n d p r i m e s f i n d p r i m e s . c [ bono : ~ / d_debug ] $ . / f i n d p r i m e s e n t e r upper bound 20 Segmentation f a u l t

Ok, lets re up gdb and see where this code crashed:


1 2 3 4 5 6 7 8 9 10 11 12 13 14 [ bono : ~ / d_debug ] $ gdb f i n d p r i m e s ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / f i n d p r i m e s e n t e r upper bound 20 Program r e c e i v e d s i g n a l SIGSEGV, Segmentation f a u l t . 0x000000392815062f i n _ I O _ v f s c a n f _ i n t e r n a l ( ) from / l i b 6 4 / t l s / l i b c . so . 6 ( gdb ) b t #0 0x000000392815062f i n _ I O _ v f s c a n f _ i n t e r n a l ( ) from / l i b 6 4 / t l s / l i b c . so . 6 #1 0x000000392815866a i n s c a n f ( ) from / l i b 6 4 / t l s / l i b c . so . 6 #2 0x0000000000400524 i n main ( ) a t f i n d p r i m e s . c : 1 6 ( gdb )

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

30 / 89

Array Indexing Errors

Now, the scanf intrinsic is probably pretty safe from internal bugs, so the error is likely coming from our usage:
1 2 3 4 5 6 7 8 9 10 11 ( gdb ) l 16 s c a n f ("%d " , UpperBound ) ; 17 18 Prime [ 2 ] = 1 ; 19 20 f o r (N = 3 ; N <= UpperBound ; N += 2 ) 21 CheckPrime (N ) ; 22 i f ( Prime [N ] ) p r i n t f ("%d i s a prime \ n " ,N ) ; 23 } 24 25 v o i d CheckPrime ( i n t K) {

Yeah, pretty dumb - scanf needs a pointer argument, i.e.


1 2 ( gdb ) l 16 s c a n f ("%d " ,& UpperBound ) ;

that takes care of the rst bug ... but lets keep running from within gdb

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

31 / 89

Array Indexing Errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

[ bono : ~ / d_debug ] $ gcc g o f i n d p r i m e s f i n d p r i m e s . c [ bono : ~ / d_debug ] $ gdb f i n d p r i m e s ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / f i n d p r i m e s e n t e r upper bound 20 Program r e c e i v e d s i g n a l SIGSEGV, Segmentation f a u l t . 0x0000000000400586 i n CheckPrime (K=3) a t f i n d p r i m e s . c : 3 7 37 i f ( Prime [ J ] == 1 ) ( gdb ) b t #0 0x0000000000400586 i n CheckPrime (K=3) a t f i n d p r i m e s . c : 3 7 #1 0x0000000000400547 i n main ( ) a t f i n d p r i m e s . c : 2 1 ( gdb ) l 37 32 than t h i s square r o o t , i t must a l s o have a s m a l l e r one , 33 so no need t o check f o r l a r g e r ones ) / 34 35 J = 2; 36 while (1) { 37 i f ( Prime [ J ] == 1 ) 38 i f (K % J == 0 ) { 39 Prime [ K ] = 0 ; 40 return ; 41 } ( gdb )

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

32 / 89

Array Indexing Errors

very often we get seg faults on trying to reference an array out-of-bounds, so have a look at the value of J:
26 27 28 29 30 31 32 33 34 35 36 37 38 ( gdb ) l 37 32 than t h i s square r o o t , i t must a l s o have a s m a l l e r one , 33 so no need t o check f o r l a r g e r ones ) / 34 35 J = 2; 36 while (1) { 37 i f ( Prime [ J ] == 1 ) 38 i f (K % J == 0 ) { 39 Prime [ K ] = 0 ; 40 return ; 41 } ( gdb ) p J $1 = 376

Oops! That is just a tad outside the bounds (50). Kind of forgot to put a cap on the value of J ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

33 / 89

Array Indexing Errors

Fixing the last bug:


1 2 3 4 5 6 7 8 9 10 11 ( gdb ) l 40 35 J = 2; 36 / w h i l e ( 1 ) { / 37 f o r ( J =2; JJ <= K ; J ++) { 38 i f ( Prime [ J ] == 1 ) 39 i f (K % J == 0 ) { 40 Prime [ K ] = 0 ; 41 return ; 42 } 43 / J ++; / 44 }

Ok, now let us try to run the code:


1 2 3 4 5 [ bono : ~ / d_debug ] $ gcc g o f i n d p r i m e s f i n d p r i m e s . c [ bono : ~ / d_debug ] $ . / f i n d p r i m e s e n t e r upper bound 20 [ bono : ~ / d_debug ] $

Oh, fantastic - no primes between 1 and 20? Not hardly ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

34 / 89

Array Indexing Errors

Ok, so now we will set a couple of breakpoints - one at the call to FindPrime and the second where a successful prime is to be output:

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

35 / 89

Array Indexing Errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

( gdb ) l 16 s c a n f ("%d " ,& UpperBound ) ; 17 18 Prime [ 2 ] = 1 ; 19 20 f o r (N = 3 ; N <= UpperBound ; N += 2 ) 21 CheckPrime (N ) ; 22 i f ( Prime [N ] ) p r i n t f ("%d i s a prime \ n " ,N ) ; 23 } 24 25 v o i d CheckPrime ( i n t K) { ( gdb ) b 20 B r e a k p o i n t 1 a t 0x40052d : f i l e f i n d p r i m e s . c , l i n e 2 0 . ( gdb ) b 22 B r e a k p o i n t 2 a t 0x400550 : f i l e f i n d p r i m e s . c , l i n e 2 2 . ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / f i n d p r i m e s e n t e r upper bound 20 B r e a k p o i n t 1 , main ( ) a t f i n d p r i m e s . c : 2 0 20 f o r (N = 3 ; N <= UpperBound ; N += 2 ) ( gdb ) c Continuing . B r e a k p o i n t 2 , main ( ) a t f i n d p r i m e s . c : 2 2 22 i f ( Prime [N ] ) p r i n t f ("%d i s a prime \ n " ,N ) ; ( gdb ) p N $1 = 21 ( gdb )

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

36 / 89

Array Indexing Errors

Another gotcha - misplaced (or no) braces. Fix that:


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ( gdb ) l 16 s c a n f ("%d " ,& UpperBound ) ; 17 18 Prime [ 2 ] = 1 ; 19 20 f o r (N = 3 ; N <= UpperBound ; N += 2 ) { 21 CheckPrime (N ) ; 22 i f ( Prime [N ] ) p r i n t f ("%d i s a prime \ n " ,N ) ; 23 } 24 } 25 ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / f i n d p r i m e s e n t e r upper bound 20 3 i s a prime 5 i s a prime 7 i s a prime 11 i s a prime 13 i s a prime 17 i s a prime 19 i s a prime Program e x i t e d w i t h code 025. ( gdb )

Ah, the sweet taste of success ... (even better, give the program a return code!)
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 37 / 89

Debugging Life Itself

Game of Life

Debugging Life Itself

Well, ok, not exactly debugging life itself; rather the game of life. Mathematician John Horton Conways game of life1 , to be exact. This example will basically be similar to the prior examples, but now we will work in Fortran, and debug some integer arithmetic errors. And the context will be slightly more interesting.

see, for example, Martin Gardners article in Scientic American, 223, pp. 120-123 (1970).
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 39 / 89

Debugging Life Itself

Game of Life

Game of Life

The Game of Life is one of the better known examples of cellular automatons (CA), namely a discrete model with a nite number of states, often used in theoretical biology, game theory, etc. The rules are actually pretty simple, and can lead to some rather surprising self-organizing behavior. The universe in the game of life: Universe is an innite 2D grid of cells, each of which is alive or dead Cells interact only with nearest neighbors (including on the diagonals, which makes for eight neighbors)

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

40 / 89

Debugging Life Itself

Game of Life

Rules of Life

The rules in the game of life: Any live cell with fewer than two neighbours dies, as if by loneliness Any live cell with more than three neighbours dies, as if by overcrowding Any live cell with two or three neighbours lives, unchanged, to the next generation Any dead cell with exactly three neighbours comes to life An initial pattern is evolved by simultaneously applying the above rules to the entire grid, and subsequently at each tick of the clock.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

41 / 89

Debugging Life Itself

Game of Life

Sample Code - Game of Life

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

program l i f e ! ! Conway game o f l i f e ( debugging example ) ! i m p l i c i t none integer , parameter : : n i =1000 , n j =1000 , nsteps = 100 i n t e g e r : : i , j , n , im , i p , jm , j p , nsum , isum integer , dimension ( 0 : n i , 0 : n j ) : : old , new r e a l : : arand , nim2 , njm2 ! ! i n i t i a l i z e elements o f " o l d " t o 0 o r 1 ! do j = 1 , n j do i = 1 , n i CALL random_number ( arand ) o l d ( i , j ) = NINT ( arand ) enddo enddo nim2 = n i 2 njm2 = n j 2

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

42 / 89

Debugging Life Itself

Game of Life

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

! ! time i t e r a t i o n ! t i m e _ i t e r a t i o n : do n = 1 , nsteps do j = 1 , n j do i = 1 , n i ! ! p e r i o d i c boundaries , ! im = 1 + ( i +nim2 ) ( ( i +nim2 ) / n i ) n i ! if i p = 1 + i ( i / n i ) n i ! if jm = 1 + ( j +njm2 ) ( ( j +njm2 ) / n j ) n j ! if j p = 1 + j ( j / n j ) n j ! if ! ! f o r each p o i n t , add s u r r o u n d i n g v a l u e s ! nsum = o l d ( im , j p ) + o l d ( i , j p ) + o l d ( i p , j p ) + o l d ( im , j ) + old ( ip , j ) + o l d ( im , jm ) + o l d ( i , jm ) + o l d ( i p , jm )

i =1 , i =ni , j =1 , j =nj ,

ni 1 nj 1

& &

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

43 / 89

Debugging Life Itself

Game of Life

40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

! ! s e t new v a l u e based on number o f " l i v e " n e i gh b o r s ! s e l e c t case ( nsum ) case ( 3 ) new ( i , j ) = 1 case ( 2 ) new ( i , j ) = o l d ( i , j ) case d e f a u l t new ( i , j ) = 0 end s e l e c t enddo enddo ! ! copy new s t a t e i n t o o l d s t a t e ! o l d = new p r i n t , T i c k , n , number o f l i v i n g : ,sum ( new ) enddo t i m e _ i t e r a t i o n ! ! w r i t e number o f l i v e p o i n t s ! p r i n t , number o f l i v e p o i n t s = , sum ( new ) end program l i f e

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

44 / 89

Debugging Life Itself

Game of Life

Initial Run ...

1 2 3 4 5 6 7 8 9 10 11 12 13

[ bono : ~ / d_debug ] $ i f o r t g [ bono : ~ / d_debug ] $ . / l i f e Tick 1 number Tick 2 number Tick 3 number Tick 4 number Tick 5 number Tick 6 number : : Tick 99 number Tick 100 number number o f l i v e p o i n t s =

o l i f e of of of of of of

l i f e . f90 : : : : : : 342946 334381 291022 263356 290940 322733

living living living living living living

of l i v i n g : of l i v i n g : 0

0 0

Hmm, everybody dies! What kind of life is that? ... well, not a correct one, in this context, at least. Undoubtedly the problem lies within the neighbor calculation, so let us take a closer look at the execution ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

45 / 89

Debugging Life Itself

Game of Life

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

( gdb ) l 30 25 do j = 1 , n j 26 do i = 1 , n i 27 ! 28 ! p e r i o d i c boundaries 29 ! 30 im = 1 + ( i +nim2 ) ( ( i +nim2 ) / n i ) n i 31 i p = 1 + i ( i / n i ) n i 32 jm = 1 + ( j +njm2 ) ( ( j +njm2 ) / n j ) n j 33 j p = 1 + j ( j / n j ) n j ( gdb ) b 25 B r e a k p o i n t 1 a t 0x402e23 : f i l e l i f e . f90 , l i n e 2 5 . ( gdb ) run S t a r t i n g program : / san / user / jonesm / u2 / d_debug / l i f e

! ! ! !

if if if if

i =1 , i =ni , j =1 , j =nj ,

ni 1 nj 1

Breakpoint 1 , l i f e ( ) at l i f e . f90 :25 25 do j = 1 , n j C u r r e n t language : auto ; c u r r e n t l y f o r t r a n ( gdb ) s 26 do i = 1 , n i ( gdb ) s 30 im = 1 + ( i +nim2 ) ( ( i +nim2 ) / n i ) n i ! i f ( gdb ) s 31 i p = 1 + i ( i / n i ) n i ! if ( gdb ) p r i n t im $1 = 1 ( gdb ) p r i n t ( i +nim2 ) / 1 0 0 0 $2 = 0.999

i =1 , n i i =ni , 1

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

46 / 89

Debugging Life Itself

Game of Life

Ok, so therein lay the problem - nim2 and njm2 should be integers, not real values ... x that:
1 2 3 4 5 6 7 8 9 program l i f e ! ! Conway game o f l i f e ( debugging example ) ! i m p l i c i t none integer , parameter : : n i =1000 , n j =1000 , nsteps = 100 i n t e g e r : : i , j , n , im , i p , jm , j p , nsum , isum , nim2 , njm2 integer , dimension ( 0 : n i , 0 : n j ) : : old , new r e a l : : arand

and things become a bit more reasonable:


1 2 3 4 5 6 7 8 9 [ bono : ~ / d_debug ] $ i f o r t g [ bono : ~ / d_debug ] $ . / l i f e Tick 1 number Tick 2 number : : Tick 99 number Tick 100 number number o f l i v e p o i n t s = o l i f e l i f e . f90 272990 253690 of l i v i n g : of l i v i n g :

of l i v i n g : of l i v i n g : 94664

95073 94664

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

47 / 89

Debugging Life Itself

Game of Life

Diversion - Demo life

http://www.radicaleye.com/lifepage http://en.wikipedia.org/wiki/Conways_Game_of_Life Interesting repository of Conways life and cellular automata references.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

48 / 89

Other Debugging Miscellany

Core Files

Core Files

Core les can also be used to instantly analyze problems that caused a code failure bad enough to dump a core le. Often the computer system has been set up in such a way that the default is not to output core les, however:
1 2 [ bono : ~ / d_debug ] $ u l i m i t c 0

for bash syntax. In tcsh you would use the limit built-in command to set the coredumpsize value:
1 2 3 [ bono : ~ / d_debug ] $ t c s h [ jonesm@bono ~ / d_debug ] $ l i m i t coredumpsize coredumpsize u n l i m i t e d

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

50 / 89

Other Debugging Miscellany

Core Files

Systems administrators set the core le size limit to zero by default for a good reason - these les generally contain the entire memory image of an application process when it dies, and that can be very large. End-users are also notoriously bad about leaving these les laying around ... Having said that, we can up the limit, and produce a core le that can later be used for analysis.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

51 / 89

Other Debugging Miscellany

Core Files

Core File Example

Ok, so now we can use one of our previous examples, and generate a core le:
1 2 3 4 5 6 7 [ bono : ~ / d_debug ] $ gcc g o f i n d p r i m e s _ o r i g f i n d p r i m e s _ o r i g . c [ bono : ~ / d_debug ] $ . / f i n d p r i m e s _ o r i g e n t e r upper bound 20 Segmentation f a u l t ( core dumped ) [ bono : ~ / d_debug ] $ l s l core .7428 rw - - - 1 jonesm c c r s t a f f 65536 Sep 27 12:15 core .7428

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

52 / 89

Other Debugging Miscellany

Core Files

this particular core le is not at all large (it is a very simple code, though, with very little stored data - generally the core le size will reect the size of the application in terms of its memory use when it crashed). Analyzing it is pretty much like we did when running this example live in gdb:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [ bono : ~ / d_debug ] $ l s l core .7428 rw - - - 1 jonesm c c r s t a f f 65536 Sep 27 12:15 core .7428 [ bono : ~ / d_debug ] $ gdb f i n d p r i m e s _ o r i g core .7428 GNU gdb Red Hat L i n u x ( 6 . 3 . 0 . 0 1 . 1 4 3 . e l 4 r h ) ... Core was generated by . / f i n d p r i m e s _ o r i g . Program t e r m i n a t e d w i t h s i g n a l 11 , Segmentation f a u l t . Reading symbols from / l i b 6 4 / t l s / l i b c . so . 6 . . . done . Loaded symbols f o r / l i b 6 4 / t l s / l i b c . so . 6 Reading symbols from / l i b 6 4 / l dl i n u xx8664.so . 2 . . . done . Loaded symbols f o r / l i b 6 4 / l dl i n u xx8664.so . 2 #0 0x000000392815062f i n _ I O _ v f s c a n f _ i n t e r n a l ( ) from / l i b 6 4 / t l s / l i b c . so . 6 ( gdb ) b t #0 0x000000392815062f i n _ I O _ v f s c a n f _ i n t e r n a l ( ) from / l i b 6 4 / t l s / l i b c . so . 6 #1 0x000000392815866a i n s c a n f ( ) from / l i b 6 4 / t l s / l i b c . so . 6 #2 0x0000000000400524 i n main ( ) a t f i n d p r i m e s _ o r i g . c : 1 6 ( gdb ) l 16 11 i n t main ( ) 12 { 13 i n t N; 14 15 p r i n t f ( " e n t e r upper bound \ n " ) ; 16 s c a n f ("%d " , UpperBound ) ;

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

53 / 89

Other Debugging Miscellany

Core Files

Summary on Core Files

So why would you want to use a core le rather than interactively debug? Your bug may take quite a while to manifest itself You have to debug inside a batch queuing system where interactive use is difcult or curtailed You want to capture a picture of the code state when it crashes

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

54 / 89

Other Debugging Miscellany

More Command-line Debuggers

More Comannd-line Debugging Tools

We focused on gdb, but there are command-line debuggers that accompany just about every available compiler product: pgdbg part of the PGI compiler suite, defaults to a GUI, but can be run as a command line interface (CLI) using the -text option idb part of the Intel compiler suite, defaults to CLI (has a special option -gdb for using gdb command syntax)

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

55 / 89

Other Debugging Miscellany

Run-time Compiler Checks

Run-time Compiler Checks

Most compilers support run-time checks than can quickly catch common bugs. Here is a handy short-list (contributions welcome!): For Intel fortran, -check bounds -traceback -g will automate bounds checking, and enable extensive traceback analysis in case of a crash (leave out the bounds option to get a crash report on any IEEE exception, format mismatch, etc.) For PGI compilers, -Mbounds -g will do bounds checking For GNU compilers, -fbounds-check -g should also do bounds checking, but is only currently supported for Fortran and Java front-ends.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

56 / 89

Other Debugging Miscellany

Run-time Compiler Checks

Run-time Compiler Checks(contd)

WARNING It should be noted that run-time error checking can very much slow down a codes execution, so it is not something that you will want to use all of the time.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

57 / 89

Other Debugging Miscellany

Serial Debugging GUIs

Serial Debugging GUIs

There are, of course, a matching set of GUIs for the various debuggers. A short list: ddd a graphical front-end for the venerable gdb pgdbg GUI for the PGI debugger idb -gui GUI for Intel compiler suite debugger It is very much a matter of preference whether or not to use the GUI. I nd the GUI to be constraining, but it does make navigation easier.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

58 / 89

Other Debugging Miscellany

Serial Debugging GUIs

DDD Example
Running one of our previous examples using ddd ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

59 / 89

Other Debugging Miscellany

Serial Debugging GUIs

More Information on Debuggers

More information on the tools that we have used/mentioned (man pages are also a good place to start): gdb User Manual:
http://sources.redhat.com/gdb/current/onlinedocs/gdb_toc.html

ddd User Guide:


http://www.gnu.org/manual/ddd/pdf/ddd.pdf

idb Manual:
http://www.intel.com/software/products/compilers/docs/linux/idb_ manual_l.html

pgdbg Guide (locally on CCR systems):


file:///util/pgi/linux86-64/6.2/doc/index.htm

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

60 / 89

Other Debugging Miscellany

Source Code Checking Tools

Source Code Checking Tools

Now, in a completely different vein, there are tools designed to help identify errors pre-compilation, namely by running it through the source code itself. splint is a tool for statically checking C programs: http://www.splint.org ftncheck is a tool that checks only (alas) FORTRAN 77 codes: http://www.dsm.fordham.edu/~ftnchek/ I cant say that I have found these to be particulary helpful, though.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

61 / 89

Other Debugging Miscellany

Source Code Checking Tools

Memory Allocation Tools

Memory allocation problems are very common - there are some tools designed to help you catch such errors at run-time: efence , or Electric Fence, tries to trap any out-of-bounds references (see man efence) valgrind is a suite of tools for anlayzing and proling binaries (see man valgrind) - there is a user manual available at:
file:///usr/share/doc/valgrind-3.6.0/html/manual.html

valgrind I have seen used with good success, but not particularly in the HPC arena.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

62 / 89

Other Debugging Miscellany

Source Code Checking Tools

Strace

strace is a powerful tool that will allow you to trace all system calls and signals made by a particular binary, whether or not you have source code. Can be attached to already running processes. A powerful lowlevel tool. You can learn a lot from it, but is often a tool of last resort for user applications in HPC due to the copious quantity of extraneous information it outputs.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

63 / 89

Other Debugging Miscellany

Source Code Checking Tools

Strace Example
As an example of using strace, lets peek in on a running MPI process (part of a 32 task job on U2):
[ c06n15 : ~ ] $ ps u jonesm L f UID PID PPID LWP C NLWP STIME TTY TIME CMD jonesm 23964 16284 23964 92 2 14:34 ? 0 0 : 0 4 : 1 1 / u t i l / nwchem / nwchem5.0/ b i n / jonesm 23964 16284 23965 99 2 14:34 ? 0 0 : 0 4 : 3 0 / u t i l / nwchem / nwchem5.0/ b i n / jonesm 23987 23986 23987 0 1 14:37 p t s / 0 0 0 : 0 0 : 0 0 bash jonesm 24128 23987 24128 0 1 14:39 p t s / 0 0 0 : 0 0 : 0 0 ps u jonesm L f [ c06n15 : ~ ] $ s t r a c e p 23965 Process 23965 a t t a c h e d i n t e r r u p t t o q u i t : l s e e k ( 4 5 , 691535872 , SEEK_SET) = 691535872 read ( 4 5 , " \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 2 \ 2 7 3 \ f [ \ 2 5 0 \ 2 0 7V\ 2 7 6 \ 3 7 6K& ] \ 3 3 1 \ 2 3 0 d " . . . , 524288)=524288 g e t t i m e o f d a y ({1161107631 , 126604} , { 2 4 0 , 1161107631}) = 0 g e t t i m e o f d a y ({1161107631 , 128553} , { 2 4 0 , 1161107631}) = 0 : : s e l e c t ( 4 7 , [ 3 4 6 7 8 9 42 43 44 4 6 ] , [ 4 ] , NULL , NULL ) = 2 ( i n [ 4 ] , o u t [ 4 ] ) w r i t e ( 4 , " \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 " . . . , 2932) = 2932 writev (4 , [ { " \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 7 \ 0 \ 0 \ 0 \ 3 7 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 , \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 " . . . , 32} , { " \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 3 7 \ 0 \ 0 \ 0 \ 1 7 \ 0 \ 0 \ 0 \ 3 7 \ 0 \ 0 \ 0 , \ 0 \ 1 \ 0 0 0 0 u " . . . , 4 4 } ] , 2 ) = 76

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

64 / 89

Part II Advanced (Parallel) Debugging

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

65 / 89

Basic Parallel Debugging

Wither Goest the GUI?

Wither Goest the GUI?

Using a GUI-based debugger gets considerably more difcult when dealing with debugging an MPI-based parallel code (not so much on the OpenMP side), due to the fact that you are now dealing with multiple processes scattered across different machines. The TotalView debugger is the premier product in this arena (it has both CLI and GUI support) - but it is very expensive, and not present in all environments. We will start out using our same toolbox as before, and see that we can accomplish much without spending a fortune. The methodologies will be equally applicable to the fancy commercial products.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

67 / 89

Basic Parallel Debugging

Process Checking

Process Checking

First on the agenda - parallel processing involves multiple processes/threads (or both), and the rst rule is to make sure that they are ending up where you think that they should be (needless to say, all too often they do not). Use MPI_Get_processor_name to report back on where processes are running Use ps to monitor processes as they run (useful ags: ps u -L), even on remote nodes (rsh/ssh into them)

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

68 / 89

Basic Parallel Debugging

Process Checking

Process Checking Example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

[ bono : ~ ] $ q s t a t n 239365 bono . c c r . b u f f a l o . edu : Req d Req d Job ID Username Queue Jobname SessID NDS TSK Memory Time S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 239365. bono . c c r . b u f f jonesm ccr QAtest 27130 4 24:00 R c23n31 /1+ c23n31 /0+ c23n30 /1+ c23n30 /0+ c23n29 /1+ c23n29 /0+ c23n28 /1+ c23n28 / 0 [ bono : ~ ] $ r s h c23n31 PID LWP TTY 27130 27130 ? 27201 27201 ? 27235 27235 ? 27244 27244 ? 30970 30970 ? 30982 30982 ? 30984 30984 ? 30985 30985 ? 30985 30987 ? 30986 30986 ? 1616 1616 ? ps u jonesm L TIME CMD 0 0: 0 0 : 0 0 bash 0 0: 0 0 : 0 0 pbs_demux 0 0: 0 0 : 0 0 bash 0 0: 0 0 : 0 0 doqmtests . mpi 0 0: 0 0 : 0 0 r u n t e s t s . mpi . un 0 0: 0 0 : 0 0 mpiexec 0 0: 0 0 : 0 0 mpiexec 0 0: 2 7 : 4 0 nwchem n t e l 9 1 i 0 0: 0 2 : 3 7 nwchem n t e l 9 1 i 0 0: 2 7 : 3 2 nwchem n t e l 9 1 i 0 0: 0 0 : 0 0 ps Elap Time - - 00:50

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

69 / 89

Basic Parallel Debugging

Process Checking

or you can script it (I called this script job_ps):


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # ! / b i n / sh # # S h e l l s c r i p t t o t a k e a s i n g l e argument (PBS j o b i d ) and launch a # ps command on each node # QST= which q s t a t i f [ z $QST ] ; then echo "ERROR: no q s t a t i n PATH : PATH="$PATH exit fi case $# i n 0) echo " s i n g l e PBS_JOBID r e q u i r e d . " ; e x i t ; ; # no args , e x i t 1) j o b i d =$1 ; ; ) echo " s i n g l e PBS_JOBID r e q u i r e d . " ; e x i t ; ; # t o o many args , e x i t esac

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

70 / 89

Basic Parallel Debugging

Process Checking

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

# g e t node l i s t i n g & t r i m o u t excess v e r b i a g e n l i n e s = $QST an $ j o b i d | wc l n l i n e s = echo " $ n l i n e s 6" | bc n o d e l i s t = $QST n $ j o b i d | t a i l $ n l i n e s | sed " s / \ / [ 0 1 ] + / , / g " | sed " s / \ / [ 0 1 ] / / " | sed " s / + / / g " | sed " s / , / \ / g " | awk { f o r ( i =1; i <=NF ; i ++) p r i n t f ( " %s \ n " , $ i ) } | u n i q | awk { p r i n t f " %s " , $1 } | awk { f o r ( i =1; i <=NF1; i ++) p r i n t f ("% s " , $ i ) p r i n t f ("% s " , $NF ) } # d e f i n e ps command #MYPS=" ps aeLf | awk { i f ( \ $5 > 10) p r i n t \ $1 , \ $2 , \ $3 , \ $4 , \ $5 , \ $9 , \ $10 } " MYPS=" ps u jonesm L o pid , pcpu , time ,comm" echo "MYPS = $MYPS" f o r node i n $ n o d e l i s t ; do echo "NODE = $node , my CPU/ t h r e a d Usage : " r s h $node $MYPS done

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

71 / 89

Basic Parallel Debugging

Process Checking

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

[ bono : ~ ] $ job_ps 239365 MYPS = ps u jonesm L o pid , pcpu , time ,comm NODE = c23n31 , my CPU/ t h r e a d Usage : PID % CPU TIME COMMAND 27130 0 . 0 0 0: 0 0 : 0 0 bash 27201 0 . 0 0 0: 0 0 : 0 0 pbs_demux 27235 0 . 0 0 0: 0 0 : 0 0 bash 27244 0 . 0 0 0: 0 0 : 0 0 doqmtests . mpi 1652 0 . 0 0 0: 0 0 : 0 0 r u n t e s t s . mpi . un 1664 0 . 0 00 : 0 0 : 0 0 mpiexec 1666 0 . 0 00 : 0 0 : 0 0 mpiexec 1667 94.5 00 : 1 7 : 1 8 nwchem n t e l 9 1 i 1667 10.0 00 : 0 1 : 5 0 nwchem n t e l 9 1 i 1668 94.2 00 : 1 7 : 1 5 nwchem n t e l 9 1 i 3813 0 . 0 00 : 0 0 : 0 0 ps NODE = c23n30 , my CPU/ t h r e a d Usage : PID % CPU TIME COMMAND 1975 96.2 00 : 1 7 : 3 6 nwchem n t e l 9 1 i 1975 6 . 6 00 : 0 1 : 1 3 nwchem n t e l 9 1 i 1976 96.0 00 : 1 7 : 3 4 nwchem n t e l 9 1 i 4033 0 . 0 00 : 0 0 : 0 0 ps NODE = c23n29 , my CPU/ t h r e a d Usage : PID % CPU TIME COMMAND 2673 95.2 00 : 1 7 : 2 6 nwchem n t e l 9 1 i 2673 8 . 9 00 : 0 1 : 3 8 nwchem n t e l 9 1 i 2674 94.5 00 : 1 7 : 1 9 nwchem n t e l 9 1 i 4728 1 . 0 00 : 0 0 : 0 0 ps

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

72 / 89

Basic Parallel Debugging

Process Checking

28 29 30 31 32 33

NODE = c23n28 , my CPU/ t h r e a d Usage : PID % CPU TIME COMMAND 19284 88.2 00 : 1 6 : 0 9 nwchem n t e l 9 1 i 19284 14.8 00 : 0 2 : 4 3 nwchem n t e l 9 1 i 19285 88.2 00 : 1 6 : 0 9 nwchem n t e l 9 1 i 21374 0 . 0 00 : 0 0 : 0 0 ps

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

73 / 89

GDB in Parallel

Using Serial Debuggers in Parallel?

Using Serial Debuggers in Parallel?

Yes, you can certainly run debuggers designed for use in sequential codes in parallel. They are even quite effective. You may just have to jump through a few extra hoops to do so ...

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

75 / 89

GDB in Parallel

Attaching GDB

Attaching GDB to Running Processes


The simplest way to use a CLI-based debugger in parallel is to attach it to already running processes, namely: Find the parallel processes using the ps command (may have to rsh/ssh into remote nodes if that is where they are running) Invoke gdb on each process ID:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [ u2 : ] $ ssh f10n32 [ u2 : ~ ] $ ps u jonesm PID TTY TIME CMD 680 p t s / 0 0 0: 0 0 : 0 0 bash 682 p t s / 0 0 0: 0 0 : 0 0 pbs_mom 683 p t s / 0 0 0: 0 0 : 0 0 pbs_demux 784 ? 0 0: 0 0 : 0 0 python 789 p t s / 0 0 0: 0 0 : 0 0 python 790 p t s / 0 0 0: 0 0 : 0 0 sh < d e f u n c t > 791 ? 00 : 0 0 : 0 0 python 792 ? 00 : 0 0 : 1 4 pp . gdb 797 ? 00 : 0 0 : 0 0 sshd [ f10n32 : ~ ] $ gdb pp . gdb 792 GNU gdb (GDB) Red Hat E n t e r p r i s e L i n u x (7.2 48. e l 6 ) .... 0x0000000000400dd5 i n pp ( ) a t pp . f 9 0 : 4 2 42 do w h i l e ( gdbWait / = 1 )

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

76 / 89

GDB in Parallel

Attaching GDB

Of course, unless you put an explicit waiting point inside your code, the processes are probably happily running along when you attach to them, and you will likely want to exert some control over that.

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

77 / 89

GDB in Parallel

Attaching GDB

First, using our above example, I was running one mpi task on f10n32 and one on f10n24. After attaching gdb to each process, they paused:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [ f10n32 : ~ / d_hw / d_hw2 / d_pp ] $ gdb pp . gdb 923 GNU gdb (GDB) Red Hat E n t e r p r i s e L i n u x (7.2 48. e l 6 ) ... ( gdb ) where #0 0x00000031ee0dc0e8 i n p o l l ( ) from / l i b 6 4 / l i b c . so . 6 #1 0x00002b4015ef9a1b i n MPID_nem_tcp_connpoll ( i n _ b l o c k i n g _ p o l l =1) a t . . / . . / socksm . c :2526 #2 0x00002b4015ef8e82 i n MPID_nem_tcp_poll ( i n _ b l o c k i n g _ p o l l =1) a t . . / . . / socksm . c :2324 #3 0x00002b4015e51a56 i n MPID_nem_network_poll ( i n _ b l o c k i n g _ p r o g r e s s =1) a t . . . #4 0 x00002b4015cff4ce i n MPIDI_CH3I_Progress ( p r o g r e s s _ s t a t e =0 x 7 f f f 3 a f e 2 b 9 0 , . . . #5 0x00002b4015eab822 i n PMPI_Recv ( b u f =0x601e60 , count =1 , d a t a t y p e =1275070495 , . . . a t . . / . . / r e c v . c :156 #6 0x00002b40161ab000 i n pmpi_recv__ ( ) from / u t i l / i n t e l / 2 0 1 1 . 0 / i m p i / 4 . 0 . 1 . 0 0 7 / . . . #7 0x0000000000401288 i n pp ( ) a t pp . f 9 0 : 8 0 #8 0x000000000040176a i n main ( ) #9 0x00000031ee01ecdd i n _ _ l i b c _ s t a r t _ m a i n ( ) from / l i b 6 4 / l i b c . so . 6 #10 0x0000000000400c49 i n _ s t a r t ( ) ( gdb ) c Continuing .

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

78 / 89

GDB in Parallel

Attaching GDB

and on f10n24:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 [ u2 : ~ ] $ ssh f10n24 [ f10n24 : ~ ] $ ps u jonesm PID TTY TIME CMD 22673 ? 0 0: 0 0 : 0 0 python 22684 ? 0 0: 0 0 : 0 0 python 22686 ? 0 0: 0 2 : 5 4 pp . gdb 22693 ? 0 0: 0 0 : 0 0 sshd 22694 p t s / 0 0 0: 0 0 : 0 0 bash 22733 p t s / 0 0 0: 0 0 : 0 0 ps [ f10n24 : ~ ] $ gdb pp . gdb 22686 GNU gdb (GDB) Red Hat E n t e r p r i s e L i n u x (7.2 48. e l 6 ) ... ( gdb ) where #0 0x000000309b2dc0e8 i n p o l l ( ) from / l i b 6 4 / l i b c . so . 6 #1 0x00002b333112fa1b i n MPID_nem_tcp_connpoll ( i n _ b l o c k i n g _ p o l l =1) a t . . / . . / socksm . c :2526 #2 0x00002b333112ee82 i n MPID_nem_tcp_poll ( i n _ b l o c k i n g _ p o l l =1) a t . . / . . / socksm . c :2324 #3 0x00002b3331087a56 i n MPID_nem_network_poll ( i n _ b l o c k i n g _ p r o g r e s s =1) a t . . / . . / . . . #4 0x00002b3330f354ce i n MPIDI_CH3I_Progress ( p r o g r e s s _ s t a t e =0 x 7 f f f d 9 d f f 6 b 0 , . . . #5 0x00002b33310e1822 i n PMPI_Recv ( b u f =0x601e60 , count =4 , d a t a t y p e =1275070495 , . . . a t . . / . . / r e c v . c :156 #6 0x00002b33313e1000 i n pmpi_recv__ ( ) from / u t i l / i n t e l / 2 0 1 1 . 0 / i m p i / 4 . 0 . 1 . 0 0 7 / . . . #7 0x00000000004013be i n pp ( ) a t pp . f 9 0 : 9 1 #8 0x000000000040176a i n main ( ) #9 0x000000309b21ecdd i n _ _ l i b c _ s t a r t _ m a i n ( ) from / l i b 6 4 / l i b c . so . 6 #10 0x0000000000400c49 i n _ s t a r t ( ) ( gdb ) c Continuing .

and we used the (c) continue command to let the execution pick up again where we (temporarily) interrupted it.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 79 / 89

GDB in Parallel

Attaching GDB

Using a Waiting Point

You can insert a waiting point into your code to ensure that execution waits until you get a chance to attach a debugger:
i n t e g e r : : gdbWait=0 : : CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid , i e r r ) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, Nprocs , i e r r ) ! dummy pause p o i n t f o r gdb i n s t e r t i o n do while ( gdbWait / = 1 ) end do

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

80 / 89

GDB in Parallel

Attaching GDB

and then you will nd the waiting at that point when you attach gdb, and you can release it at your leisure:
1 2 3 4 5 6 7 8 9 [ f10n32 : ~ / d_hw / d_hw2 / d_pp ] $ gdb pp . gdb 1003 GNU gdb (GDB) Red Hat E n t e r p r i s e L i n u x (7.2 48. e l 6 ) ... 0x0000000000400dd2 i n pp ( ) a t pp . f 9 0 : 4 2 42 do w h i l e ( gdbWait / = 1 ) ( gdb ) s gdbWait = 1 44 i f ( Nprocs / = 2 ) then ( gdb ) c Continuing .

1 2 3 4 5 6 7 8

[ f10n24 : ~ / d_hw / d_hw2 / d_pp ] $ gdb pp . gdb 22777 GNU gdb (GDB) Red Hat E n t e r p r i s e L i n u x (7.2 48. e l 6 ) 0x0000000000400dd2 i n pp ( ) a t pp . f 9 0 : 4 2 42 do w h i l e ( gdbWait / = 1 ) ( gdb ) s gdbWait = 1 44 i f ( Nprocs / = 2 ) then ( gdb ) c Continuing .

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

81 / 89

GDB in Parallel

Using GDB Within MPI Task Launcher

Using GDB Within MPI Task Launcher


Last, but not least, you can usually launch gdb through your MPI task launcher. For example, using the Intel MPI task launcher, mpiexec,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 [ k07n14 : ~ / d_hw / d_hw2 / d_pp ] $ mpiexec gdb np 2 . / pp . gdb 01: ( gdb ) l i s t 30 01: 25 01: 26 i n t e g e r : : gdbWait=0 01: 27 i n t e g e r myid , Nprocs , i e r r , mpi_procname_length 01: 28 i n t e g e r : : s t a t u s ( MPI_STATUS_SIZE ) 01: 29 c h a r a c t e r ( l e n =MPI_MAX_PROCESSOR_NAME) : : mpi_procname 01: 30 01: 31 ! 01: 32 ! I n i t i a l i z e communicator , check t h a t we are u s i n g o n l y 2p 01: 33 ! 01: 34 CALL MPI_INIT ( i e r r ) 01: ( gdb ) b 34 01: B r e a k p o i n t 2 a t 0 x400d1f : f i l e pp . f90 , l i n e 3 4 . 01: ( gdb ) run 01: C o n t i n u i n g . 01: 01: B r e a k p o i n t 2 , pp ( ) a t pp . f 9 0 : 3 4 01: 34 CALL MPI_INIT ( i e r r ) 01: ( gdb ) 01: ( gdb ) c 01: C o n t i n u i n g . 0 : H e l l o from proc 0 o f 2 k07n14 . c c r . b u f f a l o . edu 1 : H e l l o from proc 1 o f 2 k07n14 . c c r . b u f f a l o . edu 0 : Number Averaged f o r Sigmas : 2

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

82 / 89

GDB in Parallel

Using GDB Within MPI Task Launcher

or, in batch mode:


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [ u2 : ~ ] $ qsub q debug lnodes =2:GM: ppn =2 , w a l l t i m e =00:15:00 I qsub : w a i t i n g f o r j o b 993514. d15n41 . c c r . b u f f a l o . edu t o s t a r t qsub : j o b 993514. d15n41 . c c r . b u f f a l o . edu ready Job 993514. d15n41 . c c r . b u f f a l o . edu has requested 2 cores / p r o c e ss o r s per node . PBSTMPDIR i s / s c r a t c h / 9 93 5 1 4 . d15n41 . c c r . b u f f a l o . edu [ f10n32 : ~ ] $ module l o a d i n t e l mpi [ f10n32 : ~ ] $ NNODES= c a t $PBS_NODEFILE | u n i q | wc l [ f10n32 : ~ ] $ mpdboot n $NNODES f $PBS_NODEFILE [ f10n32 : ~ ] $ mpicc g o c p i . i m p i c p i . c [ f10n32 : ~ ] $ mpiexec gdb np 4 . / c p i . i m p i 03: ( gdb ) l i s t 16 ,20 03: 16 double s t a r t w t i m e = 0 . 0 , endwtime ; 03: 17 i n t namelen ; 03: 18 char processor_name [MPI_MAX_PROCESSOR_NAME ] ; 03: 19 03: 20 M P I _ I n i t (& argc ,& argv ) ; 03: ( gdb ) b 20 03: B r e a k p o i n t 2 a t 0x4009a0 : f i l e c p i . c , l i n e 2 0 . 03: ( gdb ) run 03: C o n t i n u i n g . 03: 03: B r e a k p o i n t 2 , main ( argc =1 , argv =0 x 7 f f f f f f f d 9 b 8 ) a t c p i . c : 2 0 03: 20 M P I _ I n i t (& argc ,& argv ) ; 03: ( gdb ) 03: ( gdb ) c 03: C o n t i n u i n g . 0 : Process 0 on f10n32 . c c r . b u f f a l o . edu 1 : Process 1 on f10n32 . c c r . b u f f a l o . edu 2 : Process 2 on f10n27 . c c r . b u f f a l o . edu 3 : Process 3 on f10n27 . c c r . b u f f a l o . edu

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

83 / 89

GDB in Parallel

Using GDB Within MPI Task Launcher

Using Serial Debuggers in Parallel

So you can certainly use serial debuggers in parallel - in fact it is a pretty handy thing to do. Just keep in mind: Dont forget to compile with debugging turned on You can always attach to a running code (and you can instrument the code with that purpose in mind) Beware that not all task launchers are equally friendly towards built-in support for serial debuggers

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

84 / 89

GUI-based Parallel Debugging

TotalView

The TotalView Debugger

The premier parallel debugger, TotalView: Sophisticated commercial product (think many $$ ...) Designed especially for HPC, multi-process, multi-thread Has both GUI and CLI Supports C/C++, Fortran 77/90/95, mixtures thereof The ofcial debugger of DOEs Advanced Simulation and Computing (ASC) program

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

86 / 89

GUI-based Parallel Debugging

TotalView

Using TotalView at CCR

Pretty simple to start using TotalView on the CCR systems:


1

Generally you want to load the latest version:


[ d16n03 : ~ ] $ module a v a i l t o t a l v i e w

Make sure that your X DISPLAY environment is working if you are going to use the GUI. The current CCR license supports 2 concurrent users up to 8 processors (precludes usage on nodes with more than 8 cores until/unless this license is upgraded).

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

87 / 89

GUI-based Parallel Debugging

DDT

The DDT Debugger

Allineas commercial parallel debugger, DDT: Sophisticated commercial product (think many $$ ...) Designed especially for HPC, multi-process, multi-thread Has both GUI and CLI Supports C/C++, Fortran 77/90/95, mixtures thereof CCR has a 32-token license for DDT (including CUDA support) To nd the latest installed version, module avail ddt

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

88 / 89

GUI-based Parallel Debugging

Eclipse PTP

Current Recommendations

CCR has licenses for Allineas DDT and TotalView (although the current TotalView license is very small and outdated and will be either upgraded or dropped in favor of DDT). Both are quite expensive, but stay tuned for further developments. Note that the open-source eclipse project also has a parallel tools platform that can be used in combination with C/C++ and Fortran: http://www.eclipse.org/ptp

M. D. Jones, Ph.D. (CCR/UB)

Debugging in Serial & Parallel

HPC-I Fall 2012

89 / 89

Das könnte Ihnen auch gefallen