Sie sind auf Seite 1von 22

Crash Dump Analysis

Deadlocks and hangs


Jakub Jerm
Martin Dck
Crash Dump Analysis MFF UK Deadlocks and hangs 2
Overview

Deadlock

Cycle in the resource waiting chain

Coffman conditions

Various resources mutexes, rwlocks, condition


variables, implicit resources

Hang

No forward progress

Using deadman timer


Crash Dump Analysis MFF UK Deadlocks and hangs 3
Deadlock

Configuration in which two or more activities


uninterruptibly block waiting for resources held
by the others in the blocking chain

Activities can be processes, threads, interrupts

Resources can be synchronization primitives, but


also generic resources
Crash Dump Analysis MFF UK Deadlocks and hangs 4
Coffman conditions

Necessary conditions for deadlock


(1)One resource can be owned by only one activity
at a time
(2)An activity can request additional resources even
if it already owns some
(3)A resource cannot be forcibly revoked from an
activity
(4)Cycle in the activity-resource waiting chain
Crash Dump Analysis MFF UK Deadlocks and hangs 5
Deadlock example
P1:
lock(A);
lock(B);
P2:
lock(B);
lock(A);
P1 P2
B
A
Crash Dump Analysis MFF UK Deadlocks and hangs 6
Synchronization primitives

Important to protect against race conditions

Usually figure in deadlocks

In Solaris

Mutexes

Readers-Write locks

Condition Variables
Crash Dump Analysis MFF UK Deadlocks and hangs 7
Mutex

Mutual exclusion for critical sections


mutex_enter(&pidlock)
retval = p->p_pgrp;
mutex_exit(&pidlock)

kmutex_t type in Solaris kernel

mdb dcmd ::mutex


> ffffff02e10!"e0::mutex
#$$% &'() *)+$ ,-./(+ 0+$/(+ 1#-&)%/
ffffff02e10!"e0 adapt ffffff02d!232420 - - no
Crash Dump Analysis MFF UK Deadlocks and hangs 8
Readers-Writer locks

Critical sections for multiple readers or one


writer
r5_enter(&nvf_li6t_lock7 %1_%)#$)%);
rval = nvli6t_lookup_nvli6t(nvf_li6t7 id7 &li6t);
r5_exit(&nvf_li6t_lock);
r5_enter(&nvf_li6t_lock7 %1_1%-&)%);
rval = nvli6t_add_uint2(nvf_li6t7 id7 value);
r5_exit(&nvf_li6t_lock);
Crash Dump Analysis MFF UK Deadlocks and hangs 9
Readers-Writer locks (2)

kr5lock_t type in Solaris

mdb dcmds ::r5lock


> ffffff00e4ece20::r5lock
#$$% 01.)%890:.& ;+#</ 1#-&)%/
ffffff00e4ece20 ffffff00f143=>20 ?100
@
1%-&)_+09A)$ ------B
Crash Dump Analysis MFF UK Deadlocks and hangs 10
Condition variables

Waiting for a condition to become true


mutex_enter(&a6->a_content6);
5Cile (#/_-/9+#-,<#((a6))
cv_5ait(&a6->a_cv7 &a6->a_content6);
#/_/)&9+#-,<#((a6);
mutex_exit(&a6->a_content6);

When the condition becomes true, someone


calls cv_6ignal() or cv_>roadca6t()

The condition is tested and changed under the


protection of a mutex
Crash Dump Analysis MFF UK Deadlocks and hangs 11
Condition variables (2)

kcondvar_t type in Solaris

mdb dcmd ::wchaninfo


> ffffff00e2cc"dfa::5cCaninfo -v
#$$% &'() .1#-&)%/ &*%)#$ (%09
ffffff00e2cc"dfa cond 1: ffffff00e41aa0a0 Dorg
Crash Dump Analysis MFF UK Deadlocks and hangs 12
What runs in the system?

Crash dumps taken on a deadlocked or hung


system may not exhibit the culprit directly

Need to look further and deeper

::cpuinfo

::threadlist / ::findstack

find arguments on the stack or use WCHAN as


shown by ::threadlist
Crash Dump Analysis MFF UK Deadlocks and hangs 13
::cpuinfo
> ::cpuinfo -v
-$ #$$% ;+< .%:. ?/(+ (%- %.%. A%.%. /1-&9* &*%)#$
(%09
0 fffffffff>c3aa0 1> 1 10 -1 no no t-
ffffff000220!c20 (idle)
@ @ @
%:..-.< E--B @ B--> (-+ &*%)#$
%)#$' @ 10 ffffff00022c!c20
)D-/&/ @ ! ffffff00022>fc20
).#?+) @
B--> (%- &*%)#$ (%09
"0 ffffff0002e0c20 6cCed
Crash Dump Analysis MFF UK Deadlocks and hangs 14
::threadlist
> ffffff00022c!c20::tCreadli6t -v
#$$% (%09 +1( 9+/ (%-
19*#.
ffffff00022c!c20 fffffffff>c24c0 0 0 104
fffffffff>cd"30
(9: re6ume_from_intrB0x>3 &*%)#$: unixFtCread_create_intr()
6tack pointer for tCread ffffff00022c!c20: ffffff00022c!4a0
G ffffff00022c!4a0 re6ume_from_intrB0x>3() H
65tcCB0x40()
turn6tile_>lockB0x=!>()
mutex_vector_enterB0x2"1()
clockB0x"3f()
Crash Dump Analysis MFF UK Deadlocks and hangs 15
Interpretation of WCHAN
> fffffffff>cd"30::5Cati6
fffffffff>cd"30 i6 tod_lockB0 in genunixI6 >66
> fffffffff>cd"30::mutex
#$$% &'() *)+$ ,-./(+ 0+$/(+ 1#-&)%/
fffffffff>cd"30 adapt ffffff00e41aa0a0 - - Je6

We can guess the type from the stack trace too

Need to investigate what is the holder doing


Crash Dump Analysis MFF UK Deadlocks and hangs 16
Useful queries

Is someone waiting on e.g. a rwlock?

::tCreadli6t -v K le66

8r5_enter

::findlocks

Can detect wait cycles

Needs ::typegraph

nota bene: locks may be held


Crash Dump Analysis MFF UK Deadlocks and hangs 17
Deadlock appearance

A deadlocked system will either

Crash because the kernel detects the cycle in the


waiting chain

Appear hung and unresponsive

Crash after some time due to deadman timer, if the


l>olt variable does not change

Appear working, if the resources involved in the


deadlock are not vital
Crash Dump Analysis MFF UK Deadlocks and hangs 18
Dealing with hangs

Goal: force the system to crash so that the


culprit can be found in the crash dump

It may be illustrative to explore the hung system


using kmdb before forcing the crash dump

Breakpoints and binary search to find the top-level


function which loops (if any)

The only option if the hang occurs too early before a dump can
be generated
Crash Dump Analysis MFF UK Deadlocks and hangs 19
Binary search on a stack trace
(1)On a hung system, break into kmdb
(2)$C
(3)Pick the return address in the middle of the stack trace
(4)Set a breakpoint to it
(5):c
(6)If the breakpoint was hit, clear all breakpoints (:z) and repeat the search on
the lower half of the stack trace
If the breakpoint was not hit, clear all breakpoints (:z) and repeat the search
on the upper half of the stack trace

It is possible that the stack trace starts with the top-level function; in that
case, try to put a breakpoint to a function called from it and see if it gets
called
Crash Dump Analysis MFF UK Deadlocks and hangs 20
Enforcing crash dump

If you can still use the system shell

halt -d

reboot -d

uadmin 5 1

If kmdb is loaded and you can break into it


(F1+A, Stop+A, Ctrl+] se)

$<systemdump
Crash Dump Analysis MFF UK Deadlocks and hangs 21
Enforcing crash dump (2)

If you can break to OBP prompt on SPARC


(Stop+A, Ctrl+] se)

sync

Using a button

XIR buttons on server machines

Three times the power button

Deadman timer
Crash Dump Analysis MFF UK Deadlocks and hangs 22
Deadman timer

Periodic activity, which wakes up each second


and monitors the system variable l>olt

Needs to be enabled in /etc/system

set snooping=1

If l>olt doesn't change for a pre-configured


amount of time (default is 50s), the system
dump is generated

Das könnte Ihnen auch gefallen