Beruflich Dokumente
Kultur Dokumente
This is the one section what will be updated frequently as my experience with RAC
grows, as RAC has been around for a while most problems can be resolve with a simple
google lookup, but a basic understanding on where to look for the problem is required. In
this section I will point you where to look for problems, every instance in the cluster has
its own alert logs, which is where you would start to look. Alert logs contain startup and
shutdown information, nodes joining and leaving the cluster, etc.
Here is my complete alert log file of my two node RAC starting up.
The cluster itself has a number of log files that can be examined to gain any insight of
occurring problems, the table below describes the information that you may need of the
CRS components
$ORA_CRS_HOME/crs/log contains trace files for the CRS resources
$ORA_CRS_HOME/crs/init contains trace files for the CRS daemon during startup, a good place to start
contains cluster reconfigurations, missed check-ins, connects and disconnects
$ORA_CRS_HOME/css/log
Look here to obtain when reboots occur
$ORA_CRS_HOME/css/init contains core dumps from the cluster synchronization service daemon (OCSd
$ORA_CRS_HOME/evm/lo
log files for the event volume manager and eventlogger daemon
g
$ORA_CRS_HOME/evm/ini
pid and lock files for EVM
t
$ORA_CRS_HOME/srvm/lo
log files for Oracle Cluster Registry (OCR)
g
$ORA_CRS_HOME/log log files for Oracle clusterware which contains diagnostic messages at the Ora
As in a normal Oracle single instance environment, a RAC environment contains the
standard RDBMS log files, these files are located by the parameter
background_dest_dump. The most important of these are
$ORACLE_BASE/admin/ud
contains any trace file generated by a user process
ump
$ORACLE_BASE/admin/cd
contains core files that are generated due to a core dump in a user process
ump
Now lets look at a two node startup and the sequence of events
First you must check that the RAC environment is using the connect interconnect, this
can be done by either of the following
logfile ## The location of my alert log, yours may be different
/u01/app/oracle/admin/racdb/bdump/alert_racdb1.log
ifcfg command oifcfg getif
table check select inst_id, pub_ksxpia, picked_ksxpia, ip_ksxpia from x$ksxpia;
SQL> oradebug setmypid
SQL> oradebug ipc
oradebug
Note: check the trace file which can be located by the parameter user_dump_d
cluster_interconnects
system parameter
Note: used to specify which address to use
When the instance starts up the Lock Monitor's (LMON) job is to register with the Node
Monitor (NM) (see below table). Remember when a node joins or leaves the cluster the
GRD undergoes a reconfiguration event, as seen in the logfile it is a seven step process
(see below for more details on the seven step process).
The LMON trace file also has details about reconfigurations it also details the reason for
the event
reconfiguation reason description
1 means that the NM initiated the reconfiguration event, typical when a node joins or le
means that an instance has died
2 How does the RAC detect an instance death, every instance updates the control file w
checkpoint (CKPT), if the heartbeat information is missing for x amount of time, the
dead and the Instance Membership Recovery (IMR) process initiates reconfiguration
means communication failure of a node/s. Messages are sent across the interconnect
3 an amount of time then a communication failure is assumed by default UDP is used a
an eye on the logs if too many reconfigurations happen for reason 3.
Example of a Sat Mar 20 11:35:53 2010
reconfiguration, taken Reconfiguration started (old inc 2, new inc 4)
from the alert log. List of nodes:
01
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Mar 20 11:35:53 2010
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Sat Mar 20 11:35:53 2010
LMS 0: 0 GCS shadows traversed, 3291 replayed
Sat Mar 20 11:35:53 2010
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Note: when a reconfiguration happens the GRD is frozen until the reconfiguration is
Confirm that the database has been started in cluster mode, the log file will state the
following
Sat Mar 20 11:36:02 2010
cluster mode Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Staring with 10g the SCN is broadcast across all nodes, the system will have to wait until
all nodes have seen the commit SCN. You can change the board cast method using the
system parameter _lgwr_async_broadcasts.
Lamport Algorithm
The lamport algorithm generates SCNs in parallel and they are assigned to transaction on
a first come first served basis, this is different than a single instance environment, a
broadcast method is used after a commit operation, this method is more CPU intensive as
it has to broadcast the SCN for every commit, but he other nodes can see the committed
SCN immediately.
The initialization parameter max_commit_propagation_delay limits the maximum delay
allow for SCN propagation, by default it is 7 seconds. When set to less than 100 the
broadcast on commit algorithm is used.
Disable/Enable Oracle RAC
There are times when you may wish to disable RAC, this feature can only be used in a
Unix environment (no windows option).
Disable Oracle RAC (Unix only)
Log in as Oracle in all nodes
shutdown all instances using either normal or immediate option
change to the working directory $ORACLE_HOME/lib
run the below make command to relink the Oracle binaries without the RAC option (should take a few mi
• Hung Database
• Hung Session(s)
• Query Performance
A hung database is basically an internal deadlock between to processes, usually Oracle
will detect the deadlock and rollback one of the processes, however if the situation occurs
with the internal kernel-level resources (latches or pins), it is unable to automatically
detect and resolve the deadlock, thus hanging the database. When this event occurs you
must obtain dumps from each of the instances (3 dumps per instance in regular times), the
trace files will be very large.
capture information ## Using alter session
SQL> alter session set max_dump_file_size = unlimited;
SQL> alter session set events 'immediate trace name systemstate level 10';
# using oradebug
SQL> select * from dual;
SQL> oradebug setmypid
SQL> unlimit
SQL> oradebug dump systemstate 10
-DTRACING.ENABLED=true -DTRACING.LEVEL=2
the string should look like this