Beruflich Dokumente
Kultur Dokumente
The site has been comprised of reading the following books and real world experience, if
you are new to Oracle RAC I highly recommend that you should purchase these books as
it contains far more information than this web site contains and of course the Official
Oracle web site contains all the documentation you will ever need.
Please feel free to email me any constructive criticism you have with the site as any
additional knowledge or mistakes that I have made would be most welcomed.
When you have very critical systems that require to be online 24x7 then you need a HA
solution (High Availability), you have to weigh up the risk associated with downtime
against the cost of a solution. HA solutions are not cheap and they are not easy to
manage. HA solutions need to be thoroughly tested as it may not be tested in the real
world for months. I had a solution that run for almost a year before a hardware failure
caused a failover, this is when your testing before hand comes into play.
As I said before HA comes with a price, and there are a number of HA technologies
• Fault Tolerance - this technology protects you from hardware failures for example
redundant PSU, etc
• Disaster Recovery - this technology protects from operational issues such as a
Data Center becoming unavailable
• Disaster Tolerance - this technology is used to prepare for the above two, the most
important of the three technologies.
Every company should plan for unplanned outages, this costs virtually nothing, knowing
what to do in a DR situation is half the battle, in many companies people make excuses
not to design a DR plan (it costs to much, we don't have the redundant hardware,etc).
You cannot make these assumptions until you design a DR plan, the plan will highlight
the risks and the costs that go with that risk, then you can make the decision on what you
can and cannot afford, there is no excuse not to create DR plan.
Sometimes in large corporations you will hear the phrase five nines, this phrases means
the availability of a system and what downtime (approx) is allowed, the table below
highlights the uptime a system requires in order to achieve the five nines
To achieve the five nines your system is only allowed 5.25 minutes per year or 6 seconds
per week, in some HA designs it may take 6 seconds to failover.
When looking for a solution you should try and build redundancy into your plan, this is
the first step to a HA solution, for example
• Make sure computer cabinets have dual power
• Make sure servers have dual power supplies, dual network cards, redundant hard
disks that can be mirrored
• Make sure you use multipathing to the data disk which are usually on a SAN or
NAS
• Make sure that the server is connected to two different network switches
You are trying to eliminate as many Single Point Of Failures (SPOF's) as you can without
increasing the costs. Most hardware today will have these redundancy features built in,
but its up to you to make use of them.
This option usually uses the already built-in redundancy, failed disks and PSU
can be replaced online, but if a major hardware was to fail then a system outage is
unavoidable, the system will remain down until it is fixed.
No-
failover This solution can be perfectly acceptable in some environments but at what price
to the business, even in today's market QA/DEV systems cost money when not
running, i am sure that your developers are quite happy to take the day off paid
while you fix the system
This is the jewel in the HA world, a cluster can be configure in a variety of
favors, from minimal downtime while services are moved to a good nodes, to
virtual zero downtime.
Cluster
However a cluster solution does come with a heavy price tag, hardware,
configuration and maintaining a cluster is expensive but if you business loses vast
amounts of money if you system is down, then its worth it.
Many smaller companies use this solution, basically you have a additional server
ready to take over a number of servers if one where to fail. I have used this
technique myself, i create a number of scripts that can turn a cold standby server
into any number of servers, if the original server is going to have a prolonged
outage.
Cold
The problem with this solution is there is going to be downtime, especially if it
failover
takes a long time to get the standby server up to the same point in time as the
failed server.
The advantage of this solution is that one additional server could cover a number
of servers, even if it slight under powered to the original server, as long as it
keeps the service running.
Hot Many applications offer hot-standby servers, these servers are running along side
failover the live system, data is applied to the hot-standby server periodically to keep it up
to date, thus in a failover situation the server is almost ready to go.
The problem with this system is costs and manageability, also one server is
usually dedicated to one application, thus you may have to have many hot-
standby servers.
The advantage is that downtime is kept to a minimum, but there will be some
downtime, generally the time it take to get the hot-standby server up todate, for
example applying the last set of logs to a database.
Here is a summary table that shows the most command aspects of cold failover versus hot
failover
Clustering
I have discussed clustering in my Tomcat and JBoss topics, so I will only touch on the
subject lightly here. A cluster is a group of two or more interconnected nodes, that
provide a service. The cluster provides a high level of fault tolerance, if a node were to
become unavailable within the cluster the services are moved/restored to another working
node, thus the end user should never know that a fault occurred.
Clusters can be setup to use a single node in the cluster or to load balance between the
nodes, but the main object is to keep the service running, hence why you pay top dollar
for this. One advantage of a cluster is that it is very scalable because additional nodes can
be added or taken away (a node may need to be patched) without interrupting the service.
Clustering has come a long way, there are now three types of clustering architecture
each node within the cluster is independent, they share nothing. An example of
Shared this may be web servers, you a have number of nodes within the cluster
nothing supplying the same web service. The content will be static thus there is no
need to share disks, etc.
each node will be attached or have access to the same set of disks. These disks
will contain the data that is required by the service. One node will control the
application and the disk and in the event of a that node fails, the other node
Shared
will take control of both the application and the data. This means that one node
disk only
will have to be on standby setting idle waiting to take over if required to do so.
A typical traditional Veritas Cluster and Sun Cluster would fit the bill here.
again all nodes will be attached or have access to the same set of disks, but this
time each node can read/write to the disks concurrently. Normally there will
be a piece of software that controls the reading and writing to the disks
Shared ensuring data integrity. To achieve this a cluster-wide filesystem is introduced,
everything so that all nodes view the filesystem identically, the software then coordinates
the sharing and updating of files, records and databases.
Oracle RAC and IBM HACMP would be good examples of this type of cluster
The first Oracle cluster database was release with Oracle 6 for the digital VAX, this was
the first cluster database on the market. With Oracle 6.2 Oracle Parallel Server (OPS) was
born, which used Oracle's own DLM (Distributed Lock Manager). Oracle 7 used vendor-
supplied clusterware but this was complex to setup and manage, Oracle 8 introduce a
general lock manager and this was a direction for Oracle to create its own clusterware
product. Oracle's lock manager is integrated with Oracle code with an additional layer
called OSD (Operating System Dependent), this was soon integrated within the kernel
and become known as IDLM (Integrated Distributed Lock Manager) in later Oracle
versions. Oracle Real Application Clusters 9i (Oracle RAC) used the same IDLM and
relied on external clusterware software (Sun Cluster, Veritas Cluster, etc).
A Oracle parallel database consists of two or more nodes that own Oracle instances and
share a disk array. Each node has its own SGA and its own redo logs, but the data files
and control files are all shared to all instances. All data and controls are concurrently read
and written by all instances, redo logs files on the other hand can be read by any instance
but only written by the owning instance. Each instance has its own set of background
processes.
The components of a OPS database are
The Cluster Group Services (CGS) has some OSD components (node monitor interface)
and the rest is built in the kernel. CGS has a key repository used by the DLM for
communication and network related activities. This layer provides the following
• Internode messaging
• Group member consistency
• Cluster synchronization
• Process grouping, registration and deregistration
The DLM is a integral part of OPS and the RAC stack. In older versions the DLM API
module had to rely on external OS routines to check the status of a lock, this was done
using UNIX sockets and pipes. With the new IDLM the data is in the SGA of each
instance and requires only a serialized lookup using latches and/or enqueues and may
require global coordination, the algorithm for which was built directly into the Oracle
kernel. The IDLM job is to track every lock granted to a resource, memory structures
required by the DLM are allocated out of the shared pool. The design of the DLM is such
it can survive nodes failures in all but one node of the cluster.
A user must require a lock before it can operate on any resource, the Parallel Cache
Management (PCM) coordinates and maintains data blocks exists within each data buffer
cache (of an instance) so that data viewed or requested by users is never inconsistent or
incoherent. The PCM ensures that only one instance in a cluster can modify a block at
any given time, other instances have to wait until the lock is released.
DLM maintains information about all locks on a given resource, the DLM nominates one
node to manage all relevant lock information for a resource, this node is referred to as the
master node, lock mastering is distributed among all nodes. Using the IPC layer the DLM
permits it to share the load of mastering resources, which means that a user can lock a
resource on one node but actually end up communicating with the processes on another
node.
In OPS 8i Oracle introduced Cache Fusion Stage 1, this introduced a new background
process called the Block Server Process (BSP). The BSP main roles was to ship
consistent read (CR) version(s) of a block(s) across an instance in a read/write contention
scenario, this shipping is performed over a high speed interconnect. Cache Fusion Stage 2
in Oracle 9i and 10g, addresses some of the issues with Stage 1, in which both types of
blocks (CR and CUR) can be transferred using the interconnect. Since 8i the introduction
of the GV$ views meant that a DBA could view cluster-wide database and other statistics
sitting on any node/instance of the cluster.
Oracle RAC addresses the limitation in OPS by extending Cache Fusion, and the
dynamic lock mastering. Oracle 10g RAC also comes with its own integrated clusterware
and storage management framework, removing all dependencies of a third-party
clusterware product. The latest Oracle RAC offers
2. RAC Architecture
RAC Architecture Introduction
Oracle Real Application clusters allows multiple instances to access a single database, the
instances will be running on multiple nodes. In an standard Oracle configuration a
database can only be mounted by one instance but in a RAC environment many instances
can access a single database.
Oracle's RAC is heavy dependent on a efficient, high reliable high speed private network
called the interconnect, make sure when designing a RAC system that you get the best
that you can afford.
The table below describes the difference of a standard oracle database (single instance)
an a RAC environment
Single Instance
Component RAC Environment
Environment
Instance has its own
SGA Each instance has its own SGA
SGA
Instance has its own set
Background Each instance has its own set of background
of background
processes processes
processes
Accessed by only one
Datafiles Shared by all instances (shared storage)
instance
Accessed by only one
Control Files Shared by all instances (shared storage)
instance
Only one instance can write but other
Dedicated for instances can read during recovery and
Online Redo
write/read to only one archiving. If an instance is shutdown, log
Logfile
instance switches by other instances can force the idle
instance redo logs to be archived
Private to the instance but other instances will
Archived Redo Dedicated to the
need access to all required archive logs during
Logfile instance
media recovery
Flash Recovery Accessed by only one
Shared by all instances (shared storage)
Log instance
Alert Log and Dedicated to the Private to each instance, other instances never
Trace Files instance read or write to those files.
Multiple instances on
Same as single instance plus can be placed on
the same server
shared file system allowing a common
ORACLE_HOME accessing different
ORACLE_HOME for all instances in a RAC
databases ca use the
environment.
same executable files
RAC Components
The below diagram describes the basic architecture of the Oracle RAC environment
Here are the list of processes running on a freshly installed RAC
Disk architecture
With today's SAN and NAS disk storage systems, sharing storage is fairly easy and is
required for a RAC environment, you can use the below storage setups
• SAN (Storage Area Networks) - generally using fibre to connect to the SAN
• NAS ( Network Attached Storage) - generally using a network to connect to the
NAS using either NFS, ISCSI
• JBOD - direct attached storage, the old traditional way and still used by many
companies as a cheap option
All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC
environment, there is no reason not to configure multi-pathing as the cost is cheap when
adding additional paths to the disk because most of the expense is paid when out when
configuring the first path, so an additional controller card and network/fibre cables is all
that is need.
The last thing to think about is how to setup the underlining disk structure this is known
as a raid level, there are about 12 different raid levels that I know off, here are the most
common ones
Advantages
raid 0
Improved performance
(Striping)
Can Create very large Volumes
Disadvantages
Not highly available (if one disk fails, the volume fails)
A single disk is mirrored by another disk, if one disk fails the system is
unaffected as it can use its mirror.
Advantages
raid 1
Improved performance
(Mirroring)
Highly Available (if one disk fails the mirror takes over)
Disadvantages
Expensive (requires double the number of disks)
raid 5 Raid stands for Redundant Array of Inexpensive Disks, the disks are striped
with parity across 3 or more disks, the parity is used in the event that one of
the disks fails, the data on the failed disk is reconstructed by using the parity
bit.
Advantages
Improved performance (read only)
Not expensive
Disadvantages
Slow write operations (caused by having to create the parity bit)
There are many other raid levels that can be used with a particular hardware environment
for example EMC storage uses the RAID-S, HP storage uses Auto RAID, so check with
the manufacture for the best solution that will provide you with the best performance and
resilience.
Once you have you storage attached to the servers, you have three choices on how to
setup the disks
• Raw Volumes - normally used for performance benefits, however they are hard to
manage and backup
• Cluster FileSystem - used to hold all the Oracle datafiles can be used by windows
and linux, its not used widely
• Automatic Storage Management (ASM) - Oracle choice of storage management,
its a portable, dedicated and optimized cluster filesystem
I will only be discussing ASM, which i have already have a topic on called Automatic
Storage Management.
Oracle Clusterware
Oracle Clusterware software is designed to run Oracle in a cluster mode, it can support
you to 64 nodes, it can even be used with a vendor cluster like Sun Cluster.
The Clusterware software allows nodes to communicate with each other and forms the
cluster that makes the nodes work as a single logical server. The software is run by the
Cluster Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and
maintains the cluster and node membership information and the voting disk which acts as
a tiebreaker during communication failures. Consistent heartbeat information travels
across the interconnect to the voting disk when the cluster is running.
The CRSd process manages resources such as starting and stopping the services and
failover of the application resources, it also spawns separate processes to manage
application resources. CRS manages the OCR and stores the current know state of the
cluster, it requires a public, private and VIP interface in order to run. OCSSd provides
synchronization services among nodes, it provides access to the node membership and
enables basic cluster services, including cluster group services and locking, failure of this
daemon causes the node to be rebooted to avoid split-brain situations.
The last component is the Event Management Logger, which runs the EVMd process.
The daemon spawns a processes called evmlogger and generates the events when things
happen. The evmlogger spawns new children processes on demand and scans the callout
directory to invoke callouts. Death of the EVMd daemon will not halt the instance and
will be restarted.
Quick recap
Run
CRS Process Functionality Failure of the Process
AS
OPROCd - Process provides basic cluster integrity
Node Restart root
Monitor services
EVMd - Event spawns a child process event Daemon automatically
oracle
Management logger and generates callouts restarted, no node restart
OCSSd - Cluster
basic node membership, group
Synchronization Node Restart oracle
services, basic locking
Services
CRSd - Cluster Ready resource monitoring, failover Daemon restarted root
Services and node recovery automatically, no node
restart
The cluster-ready services (CRS) is a new component in 10g RAC, its is installed in a
separate home directory called ORACLE_CRS_HOME. It is a mandatory component but
can be used with a third party cluster (Veritas, Sun Cluster), by default it manages the
node membership functionality along with managing regular RAC-related resources and
services
RAC uses a membership scheme, thus any node wanting to join the cluster as to become
a member. RAC can evict any member that it seems as a problem, its primary concern is
protecting the data. You can add and remove nodes from the cluster and the membership
increases or decrease, when network problems occur membership becomes the deciding
factor on which part stays as the cluster and what nodes get evicted, the use of a voting
disk is used which I will talk about later.
The resource management framework manage the resources to the cluster (disks,
volumes), thus you can have only have one resource management framework per
resource. Multiple frameworks are not supported as it can lead to undesirable affects.
The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster
configuration, it should reside on a shared storage and accessible to all nodes within the
cluster. This shared storage is known as the Oracle Cluster Registry (OCR) and its a
major part of the cluster, it is automatically backed up (every 4 hours) the daemons plus
you can manually back it up. The OCSSd uses the OCR extensively and writes the
changes to the registry
The OCR keeps details of all resources and services, it stores name and value pairs of
information such as resources that are used to manage the resource equivalents by the
CRS stack. Resources with the CRS stack are components that are managed by CRS and
have the information on the good/bad state and the callout scripts. The OCR is also used
to supply bootstrap information ports, nodes, etc, it is a binary file.
The OCR is loaded as cache on each node, each node will update the cache then only one
node is allowed to write the cache to the OCR file, the node is called the master. The
Enterprise manager also uses the OCR cache, it should be at least 100MB in size. The
CRS daemon will update the OCR about status of the nodes in the cluster during
reconfigurations and failures.
The voting disk (or quorum disk) is shared by all nodes within the cluster, information
about the cluster is constantly being written to the disk, this is know as the heartbeat. If
for any reason a node cannot access the voting disk it is immediately evicted from the
cluster, this protects the cluster from split-brains (the Instance Membership Recovery
algorithm IMR is used to detect and resolve split-brains) as the voting disk decides what
part is the really cluster. The voting disk manages the cluster membership and arbitrates
the cluster ownership during communication failures between nodes. Voting is often
confused with quorum the are similar but distinct, below details what each means
A vote is usually a formal expression of opinion or will in response to a
Voting
proposed decision
is defined as the number, usually a majority of members of a body, that,
Quorum
when assembled is legally competent to transact business
The only vote that counts is the quorum member vote, the quorum member vote defines
the cluster. If a node or group of nodes cannot archive a quorum, they should not start
any services because they risk conflicting with an established quorum.
The voting disk has to reside on shared storage, it is a a small file (20MB) that can be
accessed by all nodes in the cluster. In Oracle 10g R1 you can have only one voting disk,
but in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's.
The original Virtual IP in Oracle was Transparent Application Failover (TAF), this had
limitations, this has now been replaced with cluster VIPs. The cluster VIPs will failover
to working nodes if a node should fail, these public IPs are configured in DNS so that
users can access them. The cluster VIPs are different from the cluster interconnect IP
address and are only used to access the database.
The cluster interconnect is used to synchronize the resources of the RAC cluster, and also
used to transfer some data from one instance to another. This interconnect should be
private, highly available and fast with low latency, ideally they should be on a minimum
private 1GB network. What ever hardware you are using the NIC should use multi-
pathing (Linux - bonding, Solaris - IPMP). You can use crossover cables in a QA/DEV
environment but it is not supported in a production environment, also crossover cables
limit you to a two node cluster.
The kernel components relate to the background processes, buffer cache and shared pool
and managing the resources without conflicts and corruptions requires special handling.
In RAC as more than one instance is accessing the resource, the instances require better
coordination at the resource management level. Each node will have its own set of
buffers but will be able to request and receive data blocks currently held in another
instance's cache. The management of data sharing and exchange is done by the Global
Cache Services (GCS).
All the resources in the cluster group form a central repository called the Global Resource
Directory (GRD), which is distributed. Each instance masters some set of resources and
together all instances form the GRD. The resources are equally distributed among the
nodes based on their weight. The GRD is managed by two services called Global Caches
Services (GCS) and Global Enqueue Services (GES), together they form and manage the
GRD. When a node leaves the cluster, the GRD portion of that instance needs to be
redistributed to the surviving nodes, a similar action is performed when a new node joins.
RAC Background Processes
Each node has its own background processes and memory structures, there are additional
processes than the norm to manage the shared resources, theses additional processes
maintain cache coherency across the nodes.
1. When instance A needs a block of data to modify, it reads the bock from disk,
before reading it must inform the GCS (DLM). GCS keeps track of the lock status
of the data block by keeping an exclusive lock on it on behalf of instance A
2. Now instance B wants to modify that same data block, it to must inform GCS,
GCS will then request instance A to release the lock, thus GCS ensures that
instance B gets the latest version of the data block (including instance A
modifications) and then exclusively locks it on instance B behalf.
3. At any one point in time, only one instance has the current copy of the block, thus
keeping the integrity of the block.
GCS maintains data coherency and coordination by keeping track of all lock status of
each block that can be read/written to by any nodes in the RAC. GCS is an in memory
database that contains information about current locks on blocks and instances waiting to
acquire locks. This is known as Parallel Cache Management (PCM). The Global
Resource Manager (GRM) helps to coordinate and communicate the lock requests from
Oracle processes between instances in the RAC. Each instance has a buffer cache in its
SGA, to ensure that each RAC instance obtains the block that it needs to satisfy a query
or transaction. RAC uses two processes the GCS and GES which maintain records of
lock status of each data file and each cached block using a GRD.
A global resource is a resource that is visible to all the nodes within the cluster. Data
buffer cache blocks are the most obvious and most heavily global resource, transaction
enqueue's and database data structures are other examples. GCS handle data buffer cache
blocks and GES handle all the non-data block resources.
All caches in the SGA are either global or local, dictionary and buffer caches are global,
large and java pool buffer caches are local. Cache fusion is used to read the data buffer
cache from another instance instead of getting the block from disk, thus cache fusion
moves current copies of data blocks between instances (hence why you need a fast
private network), GCS manages the block transfers between the instances.
you can see the statistics of this daemon by looking at the view
X$KJMSDP
this process manages the GES, it maintains consistency of GCS
memory structure in case of process death. It is also responsible for
Lock cluster reconfiguration and locks reconfiguration (node joining or
Monitor leaving), it checks for instance deaths and listens for local
LMON
Process - messaging.
GES
A detailed log file is created that tracks any reconfigurations that
have happened.
this manages the enqueue manager service requests for the GCS. It
Lock also handles deadlock detention and remote resource requests from
Manager other instances.
LMD
Daemon -
GES you can see the statistics of this daemon by looking at the view
X$KJMDDP
Lock manages instance resource requests and cross-instance call
LCK0 Process - operations for shared resources. It builds a list of invalid lock
GES elements and validates lock elements during recovery.
This is a lightweight process, it uses the DIAG framework to
Diagnostic monitor the health of the cluster. It captures information for later
DIAG
Daemon diagnosis in the event of failures. It will perform any necessary
recovery if an operational hang is detected.
3. RAC Installation, Configuration and Storage
RAC Installation
I am not going to show you a step by step guide on how to install Oracle RAC there are
many documents on the internet that explain it better then I could. However I will point
to the one I am fond of and it works very will if you want to build a cheap Oracle RAC
environment to play around with, the instructions are simple and I have had no problems
setting up, installing and configuring it.
To configure a Oracle RAC environment follow the instructions in the document Build
your own Oracle RAC cluster on Oracle Enterprise Linux and ISCSI, there is also a
newer version out using 11g. As I said the document is excellent, I used the hardware
below and it cost me a little over £400 from EBay, alot cheaper than an Oracle course.
I did try and setup a RAC environment on VMWare on my laptop (I do have an old
laptop) but it did not work very well, hence why I took the route above.
Hardware Description
3 X Compaq Evo D510 PC's
specs:
CPU - 2.4GHz (P4)
Instance Node RAM - 2GB
1, 2 and 3 HD - 40GB
Note: picked these up for £50 each, had to buy additional memory to max
it out. The third node I use to add, remove and break to see what happens
to the cluster, definitely worth getting a third node.
Compaq Evo D510 PC
specs:
Openfiler CPU - 2.4GHz (P4)
Server RAM - 2GB
HD - 40GB
HD - 250GB (brought additional disk for ISCSI storage, more than enough
for me)
2 x Netgear GS608 8 port Gigabit switches (one for the private RAC
network, one for the ISCSI network (data))
Router/Switch
Note: I could have connect it all to one switch and saved a bit of money
Miscellaneous 1GB Network cards - support jumbo frames (may or may not be required
any more) and TOE (TCP offload engine)
Network cables - cat5e
KVM switch - cheap one
Make sure you give yourself a couple of days to setup, install and configure the RAC,
take your time and make notes, I have now setup and reinstalled so many times that I can
do in a day.
Make use of that third node, don't install it with the original configuration, add it
afterwards, use this node to remove a node from the cluster and also to simulate node
failures, this is the only way to learn, keep repeating certain situations until you fully
understand how RAC works.
Good Luck!!!!!
4. RAC Administration and Management
RAC Parameters
I am only going to talk about RAC administration, if you need Oracle administration then
see my Oracle section.
It is recommended that the spfile (binary parameter file) is shared between all nodes
within the cluster, but it is possible that each instance can have its own spfile. The
parameters can be grouped into three categories
The main unique parameters that you should know about are
• instance_name - defines the name of the Oracle instance (default is the value of
the oracle_sid variable)
• instance_number - a unique number for each instance must be greater than 0 but
smaller than the max_instance parameter
• thread - specifies the set of redolog files to be used by the instance
• undo_tablespace - specifies the name of the undo tablespace to be used by the
instance
• rollback_segments - you should use Automatic Undo Management
• cluster_interconnects - use if only if Oracle has trouble not picking the correct
interconnects
The identical unique parameters that you should know about are below you can use the
below query to view all of them
• cluster_database - options are true or false, mounts the control file in either share
(cluster) or exclusive mode, use false in the below cases
o Converting from no archive log mode to archive log mode and vice versa
o Enabling the flashback database feature
o Performing a media recovery on a system table
o Maintenance of a node
• active_instance_count - used for primary/secondary RAC environments
• cluster_database_instances - specifies the number of instances that will be
accessing the database (set to maximum # of nodes)
• dml_locks - specifies the number of DML locks for a particular instance (only
change if you get ORA-00055 errors)
• gc_files_to_locks - specify the number of global locks to a data file, changing this
disables the Cache Fusion.
• max_commit_propagation_delay - influences the mechanism Oracle uses to
synchronize the SCN among all instances
• instance_groups - specify multiple parallel query execution groups and assigns
the current instance to those groups
• parallel_instance_group - specifies the group of instances to be used for parallel
query execution
• gcs_server_processes - specify the number of lock manager server (LMS)
background processes used by the instance for Cache Fusion
• remote_listener - register the instance with listeners on remote nodes.
<instance_name>.<parameter_name>=<parameter_value>
syntax for
parameter file inst1.db_cache_size = 1000000
*.undo_management=auto
alter system set db_2k_cache_size=10m scope=spfile sid='inst1';
example
Note: use the sid option to specify a particular instance
The srvctl command is used to start/stop an instance, you can also use sqlplus to start and
stop the instance
Note: starts listeners if not already running, you can use the -o option to
specify startup/shutdown options, see below for options
start all instances
force
open
mount
nomount
stop all instances srvctl stop database -d <database> -o <option>
Note: the listeners are not stopped, you can use the -o option to specify
startup/shutdown options, see below for options
immediate
abort
normal
transactional
start/stop
particular srvctl [start|stop] database -d <database> -i <instance>,<instance>
instance
Undo Management
To recap on undo management you can see my undo section, instances in a RAC do not
share undo, they each have a dedicated undo tablespace. Using the undo_tablespace
parameter each instance can point to its own undo tablespace
undo instance1.undo_tablespace=undo_tbs1
tablespace instance2.undo_tablespace=undo_tbs2
With todays Oracle you should be using automatic undo management, again I have a
detailed discussion on AUM in my undo section.
Temporary Tablespace
I have already discussed temporary tablespace's, in a RAC environment you should setup
a temporary tablespace group, this group is then used by all instances of the RAC. Each
instance creates a temporary segment in the temporary tablespace it is using. If an
instance is running a large sort, temporary segments can be reclaimed from segments
from other instances in that tablespace.
Redologs
I have already discussed redologs, in a RAC environment every instance has its own set
of redologs. Each instance has exclusive write access to its own redologs, but each
instance can read each others redologs, this is used for recovery. Redologs are located on
the shared storage so that all instances can have access to each others redologs. The
process is a little different to the standard Oracle when changing the archive mode
Flashback
Again I have already talked about flashback, there is no difference in RAC environment
apart from the setting up
SRVCTL command
We have already come across the srvctl above, this command is called the server control
utility. It can divided into two categories
I suggest that you lookup the command but I will provide a few examples
display the
registered srvctl config database
databases
srvctl status database -d <database
srvctl status instance -d <database> -i <instance>
status srvctl status nodeapps -n <node>
srvctl status service -d <database>
srvctl status asm -n <node>
srvctl stop database -d <database>
srvctl stop instance -d <database> -i <instance>,<instance>
srvctl stop service -d <database> [-s <service><service>] [-i
<instance>,<instance>]
srvctl stop nodeapps -n <node>
srvctl stop asm -n <node>
stopping/starting
srvctl start database -d <database>
srvctl start instance -d <database> -i <instance>,<instance>
srvctl start service -d <database> -s <service><service> -i
<instance>,<instance>
srvctl start nodeapps -n <node>
srvctl start asm -n <node>
srvctl add database -d <database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node>
srvctl add service -d <database> -s <service> -r <preferred_list>
srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network
srvctl add asm -n <node> -i <asm_instance> -o <oracle_home>
adding/removing
srvctl remove database -d <database> -o <oracle_home>
srvctl remove instance -d <database> -i <instance> -n <node>
srvctl remove service -d <database> -s <service> -r <preferred_list>
srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|
ip>/network
srvctl asm remove -n <node>
Services
Services are used to manage the workload in Oracle RAC, the important features of
services are
The view v$services contains information about services that have been started on that
instance, here is a list from a fresh RAC installation
• Goal - allows you to define a service goal using service time, throughput or none
• Connect Time Load Balancing Goal - listeners and mid-tier servers contain
current information about service performance
• Distributed Transaction Processing - used for distributed transactions
• AQ_HA_Notifications - information about nodes being up or down will be sent
to mid-tier servers via the advance queuing mechanism
• Preferred and Available Instances - the preferred instances for a service,
available ones are the backup instances
• DBCA
• EM (Enterprise Manager)
• DBMS_SERVICES
• Server Control (srvctl)
Two services are created when the database is first installed, these services are running
all the time and cannot be disabled.
CRS is Oracle's clusterware software, you can use it with other third-party clusterware
software, though it is not required (apart from HP True64).
CRS is start automatically when the server starts, you should only stop this service in the
following situations
CRS Administration
## Starting CRS using Oracle 10g R1
not possible
starting
## Starting CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs
## Stopping CRS using Oracle 10g R1
srvctl stop -d database <database>
srvctl stop asm -n <node>
srvctl stop nodeapps -n <node>
stopping
/etc/init.d/init.crs stop
## Oracle 10g R1
disabling/enabling
/etc/init.d/init.crs [disable|enable]
## Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl [disable|enable] crs
$ORA_CRS_HOME/bin/crsctl check crs
$ORA_CRS_HOME/bin/crsctl check evmd
checking $ORA_CRS_HOME/bin/crsctl check cssd
$ORA_CRS_HOME/bin/crsctl check crsd
$ORA_CRS_HOME/bin/crsctl check install -wait 600
Resource Applications (CRS Utilities)
status $ORA_CRS_HOME/bin/crs_stat
$ORA_CRS_HOME/bin/crs_stat -t
$ORA_CRS_HOME/bin/crs_stat -ls
$ORA_CRS_HOME/bin/crs_stat -p
Note:
-t more readable display
-ls permission listing
-p parameters
create profile $ORA_CRS_HOME/bin/crs_profile
register/unregister $ORA_CRS_HOME/bin/crs_register
application $ORA_CRS_HOME/bin/crs_unregister
$ORA_CRS_HOME/bin/crs_start
Start/Stop an application
$ORA_CRS_HOME/bin/crs_stop
$ORA_CRS_HOME/bin/crs_getparam
Resource permissions
$ORA_CRS_HOME/bin/crs_setparam
Relocate a resource $ORA_CRS_HOME/bin/crs_relocate
Nodes
olsnodes -n
member number/name
Note: the olsnodes command is located in
$ORA_CRS_HOME/bin
local node name olsnodes -l
activates logging olsnodes -g
Oracle Interfaces
display oifcfg getif
delete oicfg delig -global
oicfg setif -global <interface name>/<subnet>:public
set oicfg setif -global <interface
name>/<subnet>:cluster_interconnect
Global Services Daemon Control
starting gsdctl start
stopping gsdctl stop
status gsdctl status
Cluster Configuration (clscfg is used during installation)
clscfg -install
create a new configuration
Note: the clscfg command is located in
$ORA_CRS_HOME/bin
upgrade or downgrade and clscfg -upgrade
existing configuration clscfg –downgrade
add or delete a node from clscfg -add
the configuration clscfg –delete
create a special single-node
clscfg –local
configuration for ASM
brief listing of terminology
clscfg –concepts
used in the other nodes
used for tracing clscfg –trace
help clscfg -h
Cluster Name Check
cemutlo -n
print cluster name
Note: in Oracle 9i the ulity was called "cemutls", the
command is located in $ORA_CRS_HOME/bin
cemutlo -w
print the clusterware
version
Note: in Oracle 9i the ulity was called "cemutls"
Node Scripts
addnode.sh
Add Node
Note: see adding and deleting nodes
deletenode.sh
Delete Node
Note: see adding and deleting nodes
As you already know the OCR is the registry that contains information
• Node list
• Node membership mapping
• Database instance, node and other mapping information
• Characteristics of any third-party applications controlled by CRS
The file location is specified during the installation, the file pointer indicating the OCR
device location is the ocr.loc, this can be in either of the following
• linux - /etc/oracle
• solaris - /var/opt/oracle
The file contents look something like below, this was taken from my installation
ocrconfig_loc=/u02/oradata/racdb/OCRFile
orc.loc ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror
local_only=FALSE
OCR is import to the RAC environment and any problems must be immediately actioned,
the command can be found in located in $ORA_CRS_HOME/bin
OCR Utilities
log file $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
ocrcheck
checking
Note: will return the OCR version, total space allocated, space used,
free space, location of each device and the result of the integrity check
ocrdump
dump contents
Note: by default it dumps the contents into a file named
OCRDUMPFILE in the current directory
ocrconfig -export <file>
export/import
ocrconfig -restore <file>
# show backups
ocrconfig -showbackup
# to change the location of the backup, you can even specify a ASM
disk
ocrconfig -backuploc <path|+asm>
# perform a restore
ocrconfig -restore <file>
# delete a backup
orcconfig -delete <file>
Note: there are many more option so see the ocrconfig man page
add/remove/replace ## add/relocate the ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf'
Voting Disk
The voting disk as I mentioned in the architecture is used to resolve membership issues in
the event of a partitioned cluster, the voting disk protects data integrity.
Backups and recovery is very similar to a single instance database. This article covers
only the specific issues that surround RAC backups and recovery, I have already written
a article on standard Oracle backups and recovery.
Oracle RAC can use all the above backup technologies, but Oracle prefers you to use
RMAN oracle own backup solution.
Backup Basics
Oracle backups can be taken hot or cold, a backup will comprise of the following
• Datafiles
• Control Files
• Archive redolog files
• Parameter files (init.ora or SPFILE)
Databases have now grown to very large sizes well over a terabyte in size in some cases,
thus tapes backups are not used in these cases but sophisticated disk mirroring have taken
their place. RMAN can be used in either a tape or disk solution, it can even work with
third-party solutions such as Veritas Netbackup.
In a Oracle RAC environment it is critical to make sure that all archive redolog files are
located on shared storage, this is required when trying to recover the database, as you
need access to all archive redologs. RMAN can use parallelism when recovering, the
node that performs the recovery must have access to all archived redologs, however,
during recovery only one node applies the archived logs as in a standard single instance
configuration.
Oracle RAC also supports Oracle Data Guard, thus you can have a primary database
configured as a RAC and a standby database also configured as a RAC.
Instance Recovery
Redo information generated by an instance is called a thread of redo. All log files for that
instance belong to this thread, an online redolog file belongs to a group and the group
belongs to a thread. Details about log group file and thread association details are stored
in the control file. RAC databases have multiple threads of redo, each instance has one
active thread, the threads are parallel timelines and together form a stream. A stream
consists of all the threads of redo information ever recorded, the streams form the
timeline of changes performed to the database.
Oracle records the changes made to a database, these are called change vectors. Each
vector is a description of a single change, usually a single block. A redo record contains
one or more change vectors and is located by its Redo Byte Address (RBA) and points to
a specific location in the redolog file (or thread). It will consist of three components
Checkpoints are the same in a RAC environment and a single instance environment, I
have already discussed checkpoints, when a checkpoint needs to be triggered, Oracle will
look for the thread checkpoint that has the lowest checkpoint SCN, all blocks in memory
that contain changes made prior to this SCN across all instances must be written out to
disk. I have discussed how to control recovery in my Oracle section and this applies to
RAC as well.
Crash Recovery
Crash recovery is basically the same for a single instance and a RAC environment, I have
a complete recovery section in my Oracle section, here is a note detailing the difference
1. The on-disk block is the starting point for the recovery, Oracle will only consider
the block on the disk so the recovery is simple. Crash recovery will automatically
happen using the online redo logs that are current or active
2. The starting point is the last full checkpoint. The starting point is provided by the
control file and compared against the same information in the data file headers,
only the changes need to be applied
3. The block specified in the redolog is read into cache, if the block has the same
timestamp as the redo record (SCN match) the redo is applied.
For a RAC instance the following is the recovery process
Oracle RAC uses a two-pass recovery, because a data block could have been modified in
any of the instances (dead or alive), so it needs to obtain the latest version of the dirty
block and it uses PI (Past Image) and Block Written Record (BWR) to archive this in a
quick and timely fashion.
The cache aging and incremental checkpoint system would write a number of
blocks to disk, when the DBWR completes a data block write operation, it also
Block
adds a redo record that states the block has been written (data block address and
Written
SCN). DBWn can write block written records (BWRs) in batches, though in a
Record
lazy fashion. In RAC a BWR is written when an instance writes a block
(BRW)
covered by a global resource or when it is told that its past image (PI) buffer it
is holding is no longer necessary.
This is was makes RAC cache fusion work, it eliminates the write/write
contention problem that existed in the OPS database. A PI is a copy of a
globally dirty block and is maintained in the database buffer cache, it can be
created and saved when a dirty block is shipped across to another instance after
Past setting the resource role to global. The GCS is responsible for informing an
Image instance that its PI is no longer needed after another instance writes a newer
(PI) (current) version of the same block. PI's are discarded when GCS posts all the
holding instances that a new and consistent version of that particular block is
now on disk.
The first pass does not perform the actual recovery but merges and reads redo threads to
create a hash table of the blocks that need recovery and that are not known to have been
written back to the datafiles. The checkpoint SCN is need as a starting point for the
recovery, all modified blocks are added to the recovery set (a organized hash table). A
block will not be recovered if its BWR version is greater than the latest PI in any of the
buffer caches.
The second pass SMON rereads the merged redo stream (by SCN) from all threads
needing recovery, the redolog entries are then compared against a recovery set built in the
first pass and any matches are applied to the in-memory buffers as in a single pass
recovery. The buffer cache is flushed and the checkpoint SCN for each thread is updated
upon successful completion.
I have a detailed section on cache fusion, this section covers the recovery, cache fusion is
only used in RAC environments, as additional steps are required, such as GRD
reconfiguration, internode communication, etc. There are two types of recovery
In both cases the threads from failed instances need to be merged, in a instance recovery
SMON will perform the recovery where as in a crash recovery a foreground process
performs the recovery.
• Recovery cost is proportional to the number of failures, not the total number of
nodes
• It eliminates disk reads of blocks that are present in a surviving instance's cache
• It prunes recovery set based on the global resource lock state
• The cluster is available after an initial log scan, even before recovery reads are
complete
In cache fusion the starting point for recovery of a block is its most current PI version,
this could be located on any of the surviving instances and multiple PI blocks of a
particular buffer can exist.
Remastering is the term used that describes the operation whereby a node attempting
recovery tries to own or master the resource(s) that were once mastered by another
instance prior to the failure. When one instance leaves the cluster, the GRD of that
instance needs to be redistributed to the surviving nodes. RAC uses an algorithm called
lazy remastering to remaster only a minimal number of resources during a
reconfiguration. The entire Parallel Cache Management (PCM) lock space remains
invalid while the DLM and SMON complete the below steps
1. IDLM master node discards locks that are held by dead instances, the space is
reclaimed by this operation is used to remaster locks that are held by the surviving
instance for which a dead instance was remastered
2. SMON issues a message saying that it has acquired the necessary buffer locks to
perform recovery
Lets look at an example on what happens during a remastering, lets presume the
following
Instance B is removed from the cluster, only the resources from instance B are evenly
remastered across the surviving nodes (no resources on instances A and C are affected),
this reduces the amount of work the RAC has to perform, likewise when a instance joins
a cluster only minimum amount of resources are remastered to the new instance.
Before Remastering
After Remastering
you can also force a dynamic remastering (DRM) of an object using oradebug
RAC Performance
I have already discussed basic Oracle tuning, in this section I will mainly dicuss Oracle
RAC tuning. First lets review the best pratices of a Oracle design regarding the
application and database
• Optimize connection management, ensure that the middle tier and programs that
connect to the database are efficent in connection management and do not log on
or off repeatedly
• Tune the SQL using the available tools such as ADDM and SQL Tuning Advisor
• Ensure that applications use bind variables, cursor_sharing was introduced to
solve this problem
• Use packages and procedures (because they are compiled) in place of anonymous
PL/SQL blocks and big SQL statements
• Use locally managed tablespaces and automatic segment space management to
help performance and simplify database administration
• Use automatic undo management and temporary tablespace to simplify
administration and increase performance
• Ensure you use large caching when using sequences, unless you cannot afford to
lose sequence during a crash
• Avoid using DDL in production, it increases invalidations of the already parsed
SQL statements and they need to be recompiled
• Partion tables and indexes to reduce index leaf contention (buffer busy global cr
problems)
• Optimize contention on data blocks (hot spots) by avoiding small tables with too
many rows in a block
• If the CPU and private interconnects are of high performance then there is no
need to to partition
• Partitioning does add complexity, thus if you can increase CPU and the
interconnect performance the better
• Only partition if performance is betting impacted
• Test both partitioning and non-partitioning to what difference it makes, then
decide if partitioning is worth it
An event is an operation or particular function that the Oracle kernel performs on behalf
of a user or a Oracle background process, events have specific names like database event.
Whenever a session has to wait for something, the wait time is tracked and charged to the
event that was associated with that wait. Events that are associated with all such waits are
known as wait events. The are a number of wait classes
• Commit
• Scheduler
• Application
• Configuration
• User I/O
• System I/O
• Concurrency
• Network
• Administrative
• Cluster
• Idle
• Other
There are over 800 different events spread across the above list, however you probably
will only deal with about 50 or so that can improve performance.
When a session requests access to a data block it sends a request to the lock master for
proper authorization, the request does not know if it will receive the block via Cache
Fusion or a permission to read from the disk. Two placeholder events
• global cache cr request (consistent read - cr)
• global cache curr request (current - curr)
keep track of the time a session spends in this state. There are number of types of wait
events regarding access to a data block
Wait Contention
Description
Event type
an instance requests authorization for a block to be accessed in
current mode to modify a block, the instance mastering the
resource receives the request. The master has the current version
of the block and sends the current copy of the block to the
requestor via Cache Fusion and keeps a Past Image (.PI)
Enqueue Tuning
Oracle RAC uses a queuing mechanism to ensure proper use of shared resources, it is
called Global Enqueue Services (GES). Enqueue wait is the time spent by a session
waiting for a shared resource, here are some examples of enqueues:
Enqueues can be managed by the instance itself others are used globally, GES is
responsible for coordinating the global resources. The formula used to calculate the
number of enqueue resources is as below
I have already discussed AWR in a single instance environment, so for a quick refresh
take a look and come back here to see how you can use it in a RAC environment.
From a RAC point of view there are a number of RAC-specific sections that you need to
look at in the AWR, in the report section is a AWR of my home RAC environment, you
can view the whole report here.
RAC AWR
Report Description
Section
Number of lists the number of instances from the beginning and end of the
instances
Instances AWR report
Instance global global information about the interinstance cache fusion data block
cache load cache and messaging traffic, because my AWR report is lightweight
profile here is a more heavy used RAC example
• Local cache
Glocal cache global
• Remote cache
efficiency cache
• Disk
percentage efficiency
The first two give the cache hit ratio for the instance, you are
looking for a value less than 10%, if you are getting higher
values then you may consider application partitioning.
this section contains timing statistics for global enqueue and
global cache. As a general rule you are looking for
GCS and GES GCS and
• All timings related to CR (Consistent Read) processing
- workload GES
block should be less than 10 msec
characteristics workload
• All timings related to CURRENT block processing
should be less than 20 msec
The first section relates to sending a message and should be
less than 1 second.
Messaging
messaging The second section details the breakup of direct and indirect
statistics
messages, direct messages are sent by a instance foreground or
the user processes to remote instances, indirect are messages
that are not urgent and are pooled and sent.
Service Service
shows the resources used by all the service instance supports
statistics stats
Service wait Service
summarizes waits in different categories for each service
class statistics wait class
Top 5 CR and Top 5 CR conatns the names of the top 5 contentious segments (table or
current block and index). If a table or index has a very high percentage of CR
current and Current block transfers you need to investigate. This is
segements
blocks pretty much like a normal single instance.
Cluster Interconnect
As I stated above the interconnect it a critical part of the RAC, you must make sure that
this is on the best hardware you can buy. You can confirm that the interconnect is being
used in Oracle 9i and 10g by using the command oradebug to dump information out to a
trace file, in Oracle 10g R2 the cluster interconnect is also contained in the alert.log file,
you can view my information from here.
The RAC environment includes many resources such as multiple versions of data block
buffers in buffer caches in different modes, Oracle uses locking and queuing mechanisms
to coordinate lock resources, data and interinstance data requests. Resources such as data
blocks and locks must be synchronized between nodes as nodes within a cluster acquire
and release ownership of them. The synchronization provided by the Global Resource
Directory (GRD) maintains a cluster wide concurrency of the resources and in turn
ensures the integrity of the shared data. Synchronization is also required for buffer cache
management as it is divided into multiple caches, and each instance is responsible for
managing its own local version of the buffer cache. Copies of data are exchanged
between nodes, this sometimes is referred to as the global cache but in reality each nodes
buffer cache is separate and copies of blocks are exchanged through traditional
distributed locking mechanism.
Global Cache Services (GCS) maintain the cache coherency across buffer cache
resources and Global Enqueue Services (GES) controls the resource management across
the clusters non-buffer cache resources.
Cache Coherency
Cache coherency identifies the most up-to-date copy of a resource, also called the master
copy, it uses a mechanism by which multiple copies of an object are keep consistent
between Oracle instances. Parallel Cache Management (PCM) ensures that the master
copy of a data block is stored in one buffer cache and consistent copies of the data block
are stored in other buffer caches, the process LCKx is responsible for this task.
The lock and resource structures for instance locks reside in the GRD (also called the
DLM), its a dedicated area within the shared pool. Details about the data blocks resources
and cached versions are maintained by GCS. Additional details such as the location of the
most current version, state of the buffer, role of the data block (local or global) and
ownership are maintained by GES. Global cache together with GES form the GRD. Each
instance maintains a part of the GRD in its SGA. The GCS and GES nominate one
instance, this will become the resource master, to manage all information about a
particular resource. Each instance knows which instance master is with which resource.
Locks are placed on a resource grant or a convert queue, if the lock changes it moves
between the queues. A lock leaves the convert queue under the following conditions
• The process requests the lock termination (it remove the lock)
• The process cancels the conversion, the lock is moved back to the grant queue in
the previous mode
• The requested mode is compatible with the most restrictive lock in the grant
queue and with all the previous modes of the convert queue, and the lock is at the
head of the convert queue
Convert requests are processed on a FIFO basis, the grant queue and convert queue are
associated with each and every resource that is managed by the GES.
Enqueues are basically locks that support queuing mechanisms and that can be acquired
in different modes. An enqueue can be held in exclusive mode by one process and others
can hold a non-exclusive mode depending on the type. Enqueues are the same in RAC as
they are in a single instance.
Global Enqueue Services (GES)
GES coordinates the requests of all global enqueues, it also deals with deadlocks and
timeouts. There are two types of local locks, latches and enqueues, latches do not affect
the cluster only the local instance, enqueues can affect both the cluster and the instance.
Enqueues are shared structures that serialize access to database resources, they support
multiple modes and are held longer than latches, they protect persistent objects such as
tables or library cache objects. Enqueues can use any of the following modes
Global Locks
Each node has information for a set of resources, Oracle uses a hashing algorithm to
determine which nodes hold the directory tree information for the resource. Global locks
are mainly of two types
• Locks used by the GCS for buffer cache management, these are called PCM locks
• Global locks (global enqueue) that Oracle synchronizes within a cluster to
coordinate non-PCM resources, they protect the enqueue structures
An instance owns a global lock that protects a resource (i.e. data block or data dictionary
entry) when the resource enters the instance's SGA.
GES locks control access to data files (not the data blocks) and control files and also
serialize interinstance communication. They also control library caches and the dictionary
cache. Examples of this are DDL, DML enqueue table locks, transaction enqueues and
DDL locks or dictionary locks. The SCN and mount lock are global locks.
Transaction and row locks are the same as in a single instance database, the only
difference is that the enqueues are global enqueues, take a look in locking for an in depth
view on how Oracle locking works.
Messaging
The difference between RAC and a single instance messaging is that RAC uses the high
speed interconnect and a single instance uses shared memory and semaphores, interrupts
are used when one or more process want to use the processor in a multiple CPU
architecture. GES uses messaging for interinstance communication, this is done by
messages and asynchronous traps (ASTs). Both LMON and LMD use messages to
communicate to other instances, the GRD is updated when locks are required. The
messaging traffic can be viewed using the view V$GES_MISC.
Because GES heavily rely's on messaging the interconnect must be of high quality (high
performance , low latency), also the messages are kept small (128 bytes) to increase
performance. The Traffic Controller (TRFC) is used to control the DLM traffic between
the instances in the cluster, it uses buffering to accommodate large volumes of traffic.
The TRFC keeps track of everything by using tickets (sequence numbers), there is a
predefined pool of tickets this is dependent on the network send buffer size. A ticket is
obtained before sending any messages, once sent the ticket is returned to the pool, LMS
or LMD perform this. If there are no tickets then the message has to wait until a ticket is
available. You can control the number of tickets and view them
system _lm_tickets
parameter _lm_ticket_active_sendback (used for aggressive messaging)
select local_nid local, remote_nid remote, tckt_avail avail, tckt_limit limit,
ticket usage
snd_q_len send_queue, tckt_wait waiting from v$ges_traffic_controller;
SQL> oradebug setmypid
SQL> oradebug unlimit
dump ticket
SQL> oradebug lkdebug -t
information
Note: the output can be viewed here
GCS locks only protect data blocks in the global cache (also know as PCM locks), it can
be acquired in share or exclusive mode. Each lock element can have the lock role set to
either local (same as single instance) or global. When in global role three lock modes are
possible, shared, exclusive and null. In global role mode you can read or write to the data
block only as directed by the master instance of that resource. The lock and state
information is held in the SGA and is maintained by GCS, these are called lock elements.
It also holds a chain of cache buffer chains that are covered by the corresponding lock
elements. These can be view via v$lock_element, the parameter _db_block_hash_buckets
controls the number of hash buffer chain buckets.
used during update or any DML operation, if another instance requires the
Exclusive
block that has a exclusive lock it asks GES to request that he second instance
(X)
disown the global lock
used for select operations, reading of data does not require a instance to
Shared (S)
disown a global lock.
allows instances to keep a lock without any permission on the block(s). This
Null (N) mode is used so that locks need not be created and destroyed all the time, it
just converts from one lock to another.
Lock roles are used by Cache Fusion, it can be either local or global, the resource is local
if the block is dirty only in the local cache, it is global if the block is dirty in a remote
cache or in several remote caches. A Past Image (PI) is kept by the instance when a block
is shipped to another instance, the role is then changed to a global role, thus the PI
represents the state of a dirty buffer. A node must keep a PI until it receives notification
from the master that a write to disk has completed covering that version, the node will
then log a block written record (BWR). I have already discussed PI and BWR in my
backup section.
When a new current block arrives, the previous PI remains untouched in case another
node requires it. If there are a number of PI's that exist, they may or may not merge into a
single PI, the master will determine this based on if the older PI's are required, a
indeterminate number of PI's can exist.
In the local role only S and X modes are permitted, when requested by the master
instance the holding instance serves a copy of the block to others. If the block is globally
clean this instance lock role remains local. If the block is modified (dirty), a PI is retained
and the lock becomes global. In the global lock role lock modes can be N, S and X, the
block is global and it may even by dirty in any of the instances and the disk version may
be obsolete. Interested parties can only modify the block using X mode, an instance
cannot read from the disk as it may not be current, the holding instance can send copies to
other instances when instructed by the master.
I have a complete detailed walkthough in my cache_fusion section, which will help you
better to understand.
A lock element holds lock state information (converting, granting, etc). LEs are managed
by the lock process to determine the mode of the locks, they also old a chain of cache
buffers that are covered by the LE and allow the Oracle database to keep track of cache
buffers that must be written to disk in a case a LE (mode) needs to be downgraded (X >
N).
LEs protect all the data blocks in the buffer cache, the list below describes the classes of
the data block which are managed by the LEs using GCS locks (x$bh.class).
0 FREE
1 EXLCUR
2 SHRCUR
3 CR
4 READING
5 MRECOVERY
6 IRCOVERY
7 WRITING
8 PI
So putting this altogether you get the following, GCS manages PCM locks in the GRD,
PCM locks manage the data blocks in the global cache. Data blocks are can be kept in
any of the instances buffer cache (which is global), if not found then it can be read from
disk by the requesting instance. The GCS monitors and maintains the list and mode of the
blocks in all the instances. Each instance will master a number of resources, but a
resource can only be mastered by one instance. GCS ensures cache coherency by
requiring that instances acquire a lock before modifying or reading a database block.
GCS locks are not row-level locks, row-level locks are used in conjunction with PCM
locks. GCS lock ensures that they block is accessed by one instances then row-level locks
manage the blocks at the row-level. If a block is modified all Past Images (PI) are no
longer current and new copies are required to obtained.
Consistent read processing means that readers never block writers, as the same in a single
instance. One parameter that can help is _db_block_max_cr_dba which limits the number
of CR copies per DBA on the buffer cache. If too many CR requests arrive for a
particular buffer, the holder can disown the lock on the buffer and write the buffer to the
disk, thus the requestor can then read it from disk, especially if the requested block has a
older SCN and needs to reconstruct it (known as CR fabrication). This is technically
known as fairness downconvert, and the parameter _fairness_threshold can used to
configure it.
The lightwork rule is involved when CR construction involves too much work and no
current block or PI block is available in the cache for block cleanouts. The below can be
used to view the number of times a downconvert occurs
The GRD is a central repository for locks and resources, it is distributed across all nodes
(not a single node), but only one instance masters a resource. The process of maintaining
information about resources is called lock mastering or resource mastering. I spoke about
lock remastering in my backup section.
Resource affinity allows the resource mastering of the frequently used resources on its
local node, it uses dynamic resource mastering to move the location of the resource
masters. Normally resource mastering only happens when a instance joins or leaves the
RAC environment, as of Oracle 10g R2 mastering occurs at the object level which helps
fine-grained object remastering. There are a number of parameters that can be used to
dynamically remaster an object
You should consult Oracle before changing any of the above parameters.
8. Cache Fusion
Introduction
I mentioned above Cache Fusion in my GRD section, here I go into great detail on how it
works; I will also provide a number of walk through examples on my RAC system.
Cache Fusion uses the most efficient communications as possible to limit the amount of
traffic used on the interconnect, now you don't need this level of detail to administer a
RAC environment but it sure helps to understand how RAC works when trying to
diagnose problems. RAC appears to have one large buffer but this is not the case, in
reality the buffer caches of each node remain separate, data blocks are shared through
distributed locking and messagingoperations. RAC copies data blocks across the
interconnect to other instances as it is more efficient than reading the disk, yes memory
and networking together are faster than disk I/O.
Ping
The transfer of a data block from instances buffer cache to another instances buffer cache
is know as a ping. As mentioned already when an instance requires a data block it sends
the request to the lock master to obtain a lock in the desired mode, this process is known
as blocking asynchronous trap (BAST). When an instance receives a BAST it
downgrades the lock ASAP, however it might have to write the corresponding block to
disk, this operation is known as disk ping or hard ping. Disk pings have been reduce in
the later versions of RAC, thus relaying on block transfers more, however there will
always be a small amount of disk pinging. In the newer versions of RAC when a BAST is
received sending the block or downgrading the lock may be deferred by tens of
milliseconds, this extra time allows the holding instance to complete an active transaction
and mark the block header appropriately, this will eliminate any need for the receiving
instance to check the status of the transaction immediately after receiving/reading a
block. Checking the status of a transaction is an expensive operation that may require
access (and pinging) to the related undo segment header and undo data blocks as well.
The parameter _gc_defer_time can be used to define the duration by which an instance
deferred downgrading a lock.
Past Image Blocks (PI)
In the GRD section I mentioned Past Images (PIs), basically they are copies of data
blocks in the local buffer cache of an instance. When an instance sends a block it has
recently modified to another instance, it preserves a copy of that block, marking as a PI.
The PI is kept until that block is written to disk by the current owner of the block. When
the block is written to disk and is known to have a global role, indicating the presents of
PIs in other instances buffer caches, GCS informs the instance holding the PIs to discard
the PIs. When a checkpoint is required it informs GCS of the write requirement, GCS is
responsible for finding the most current block image and informing the instance holding
that image to perform a block write. GCS then informs all holders of the global resource
that they can release the buffers holding the PI copies of the block, allowing the global
resource to be released. You can view the past image blocks present in the fixed table
X$BH
Cache Fusion I
Cache Fusion I is also know as consistent read server and was introduced in Oracle 8.1.5,
it keeps a list of recent transactions that have changed a block.the original data contained
in the block is preserved in the undo segment, which can be used to provide consistent
read versions of the block.
• When a reader reads a recently modified block, it might find an active transaction
in the block
• The reader will need to read the undo segment header to decide whether the
transaction has been committed or not
• If the transaction is not committed, the process creates a consistent read (CR)
version of the block in the buffer cache using the data in the block and the data
stored in the undo segment
• If the undo segment shows the transaction is committed, the process has to revisit
the block and clean out the block (delay block cleanout) and generate the redo for
the changes.
In an RAC environment if the process of reading the block is on an instance other than
the one that modified the block, the reader will have to read the following blocks from
the disk
• data block to get the data and/or transaction ID and Undo Byte Address (UBA)
• undo segment header block to find the last undo block used for the entire
transaction
• undo data block to get the actual record to construct a CR image
Before these blocks can be read the instance modifying the block will have to write
those's blocks to disk, resulting in 6 I/O operations. In RAC the instance can construct a
CR copy by hopefully using the above blocks that are still in memory and then sending
the CR over the interconnect thus reducing 6 I/O operations.
As from Oracle 8 introduced a new background process called the Block Server Process
makes the CR fabrication at the holders cache and ships the CR version of the block
across the interconnect, the sequence is detailed in the table below
1. An instance sends a message to the lock manager requesting a shared lock on the
block
2. Following are the possibilities in the global cache
o If there is no current user for the block, the lock manager grants the shared
lock to the requesting instance
o if the other instance has an exclusive lock on the block, the lock manager
asks the owning instance to build a CR copy and ship it to the requesting
instance.
3. Based on the result, either of the following can happen
o if the lock is granted, the requesting instance reads the block from disk
o The owning instance creates a CR version of the buffer in its own buffer
cache and ships it to the requesting instance over the interconnect
4. The owning instance also informs the lock manager and requesting instance that it
has shipped the block
5. The requesting instance has the locked granted, the lock manager updates the
IDLM with the new holders of that resource
• it does not find any of the blocks needed in its buffer cache, it will not perform a
disk read to make a CR copy for another instance
• It is repeatedly asked to send a CR copy of the same block, after sending the CR
copies four times it will voluntarily relinquish the lock, write the block to the disk
and let other instances get the block from the disk. The number of copies it will
serve before doing so is governed by the parameter _fairness_threshold
Cache Fusion II
Read/Write contention was addressed in cache fusion I, cache fusion II addresses the
write/write contention
1. An instance sends a message to the lock manager requesting an exclusive lock on
the block
2. Following are the possibilities in the global cache
o If there is no current user for the block, the lock manager grants the
exclusive lock to the requesting instance
o if the other instance has an exclusive lock on the block, the lock manager
asks the owning instance to release the lock
3. Based on the result, either of the following can happen
o if the lock is granted, the requesting instance reads the block from disk
o The owning instance sends the current block to the requesting instance via
the interconnect, to guarantee recovery in the event of instance death, the
owning instance writes all the redo records generated for the block to the
online redolog file. It will keep a past image of the block and inform the
master instance that it has sent the current block to the requesting instance
4. The lock manager updates the resource directory (GRD) with the current holder of
the block
Cache Fusion in Operation
A quick recap of GCS, a GCS resource can be local or global, if it is local it can be acted
upon without consulting other instances, if it is global it cannot be acted upon without
consulting or informing remote instances. GCS is used as a messaging agent to
coordinate manipulation of a global resource. By default all resources are in NULL mode
(remember null mode is used to convert from one type to another (share or exclusive)).
for example a code of SL0 means a global shared lock with no past images (PIs)
instance C want to read the block it will request a lock in share mode from the master
instance
4. Instance C has the block in shard mode, the lock manager updates the resource
directory.
Reading a block from the cache
Carrying on from the above example, Instance B wants to read the same block that is
cached in instance C buffer.
4. Instance B sends a message to instance D that it has assumed the SL lock for the
block. This message is not critical for the lock manager, thus the message is sent
asynchronously
5. Instance A modifies the block in its buffer cache, the changes are not committed
and thus the block has not been written to disk, thus the SCN remains at 987654
Getting a (Cached) modified block for update and commit
Carrying on from the above example, instance C now wants to modify the block, if it tries
to modify the same row it will have to wait until instance A either commits or rolls back.
However in this case instance C wants to modify a different row in the same block.
Carrying on from the above example, instance A now issues a commit to release the row
level locks held by the transaction and flush the redo information to the redologs
1. Instance A wants to commit the changes, commit operations do not require any
synchronous modifications to the block
2. The lock status remains the same as the previous state and change vectors for the
commits are written to the redologs.
Write the dirty buffers to disk due to a checkpoint
Carrying on from the above example, instance B writes the dirty blocks from the buffer
cache due to a checkpoint (this is were it gets interesting and very clever)
7. All instances that have previously modified this block will also have to write a
BWR. The write request by instance C has now been satisfied and instance C can
now proceed with its checkpoint as usual
Master instance crashes
2. The Global Resource Directory is frozen momentarily and the resources held by
master instance D will be equally distributed in the surviving nodes, also know as
remastering (see remastering for more details).
Select the rows from Instance A
Carrying on from the above example, now instance A queries the rows from that table to
get the most recent data
9. RAC Troubleshooting
Troubleshooting
This is the one section what will be updated frequently as my experience with RAC
grows, as RAC has been around for a while most problems can be resolve with a simple
google lookup, but a basic understanding on where to look for the problem is required. In
this section I will point you where to look for problems, every instance in the cluster has
its own alert logs, which is where you would start to look. Alert logs contain startup and
shutdown information, nodes joining and leaving the cluster, etc.
Here is my complete alert log file of my two node RAC starting up.
The cluster itself has a number of log files that can be examined to gain any insight of
occurring problems, the table below describes the information that you may need of the
CRS components
Now lets look at a two node startup and the sequence of events
First you must check that the RAC environment is using the connect interconnect, this
can be done by either of the following
## The location of my alert log, yours may be different
logfile
/u01/app/oracle/admin/racdb/bdump/alert_racdb1.log
ifcfg command oifcfg getif
table check select inst_id, pub_ksxpia, picked_ksxpia, ip_ksxpia from x$ksxpia;
SQL> oradebug setmypid
SQL> oradebug ipc
oradebug
Note: check the trace file which can be located by the parameter
user_dump_dest
cluster_interconnects
system parameter
Note: used to specify which address to use
When the instance starts up the Lock Monitor's (LMON) job is to register with the Node
Monitor (NM) (see below table). Remember when a node joins or leaves the cluster the
GRD undergoes a reconfiguration event, as seen in the logfile it is a seven step process
(see below for more details on the seven step process).
The LMON trace file also has details about reconfigurations it also details the reason for
the event
reconfiguation
description
reason
means that the NM initiated the reconfiguration event, typical when a
1
node joins or leaves a cluster
means that an instance has died
How does the RAC detect an instance death, every instance updates the
2 control file with a heartbeat through its checkpoint (CKPT), if the
heartbeat information is missing for x amount of time, the instance is
considered to be dead and the Instance Membership Recovery (IMR)
process initiates reconfiguration.
means communication failure of a node/s. Messages are sent across the
interconnect if a message is not received in an amount of time then a
3 communication failure is assumed by default UDP is used and can be
unreliable so keep an eye on the logs if too many reconfigurations
happen for reason 3.
Example of a Sat Mar 20 11:35:53 2010
reconfiguration, Reconfiguration started (old inc 2, new inc 4)
taken from the List of nodes:
alert log. 01
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Mar 20 11:35:53 2010
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Sat Mar 20 11:35:53 2010
LMS 0: 0 GCS shadows traversed, 3291 replayed
Sat Mar 20 11:35:53 2010
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
Confirm that the database has been started in cluster mode, the log file will state the
following
Staring with 10g the SCN is broadcast across all nodes, the system will have to wait until
all nodes have seen the commit SCN. You can change the board cast method using the
system parameter _lgwr_async_broadcasts.
Lamport Algorithm
The lamport algorithm generates SCNs in parallel and they are assigned to transaction on
a first come first served basis, this is different than a single instance environment, a
broadcast method is used after a commit operation, this method is more CPU intensive as
it has to broadcast the SCN for every commit, but he other nodes can see the committed
SCN immediately.
There are times when you may wish to disable RAC, this feature can only be used in a
Unix environment (no windows option).
Performance Issues
Oracle can suffer a number of different performance problems and can be categorized by
the following
• Hung Database
• Hung Session(s)
• Overall instance/database performance
• Query Performance
# using oradebug
SQL> select * from dual;
SQL> oradebug setmypid
SQL> unlimit
capture information
SQL> oradebug dump systemstate 10
A severe performance problem can be mistaken for a hang, this usually happen because
of contention problems, a systemstate dump is normally used to analyze this problem,
however a systemstate dump taken a long time to complete, it also has a number of
limitations
To overcome these limitations a new utility command was released with 8i called
hanganalyze which provides clusterwide information in a RAC environment on a single
shot.
sql method alter session set events 'immediate trace hanganalyze level <level>';
oradebug SQL> oradebug hanganalyze <level>
The hanganalyze command uses internal kernel calls to determine whether a session is
waiting for a resource and reports the relationship between blockers and waiters,
systemdump is better but if you over whelmed try hanganalyze first.
A node is evicted from the cluster after it kills itself because it is not able to service the
application, this generally happens when you have communication problems. For eviction
node problems look for ora-29740 errors in the alert log file and LMON trace files.
To understand eviction problems you need to now the basics of node membership and
instance membership recovery (IMR) works. When a communication failure happens the
heartbeat information in the control cannot happen, thus data corruption can happen. IMR
will remove any nodes from the cluster that it deems as a problem, IMR will ensure that
the larger part of the cluster will survive and kills any remaining nodes. IMR is part of the
service offered by Cluster Group Services (CGS). LMON handles many of the CGS
functionalities, this works at the cluster level and can work with 3rd party software (Sun
Cluster, Veritas Cluster). The Node Monitor (NM) provides information about nodes and
their health by registering and communicating with the Cluster Manager (CM). Node
membership is represented as a bitmap in the GRD. LMON will let other nodes know of
any changes in membership, for example if a node joins or leaves the cluster, the bitmap
is rebuilt and communicated to all nodes.
Node
registering lmon registered with NM - instance id 1 (internal mem no 0)
(alert log)
One thing to remember is that all nodes must be able to read from and write to the
controlfile. CGS makes sure that members are valid, it uses a voting mechanism to check
the validity of each member. I have already discussed the voting disk in my architecture
section, as stated above memberships is held in a bitmap in the GRD, the CKPT process
updates the controlfile every 3 seconds in an operation known as a heartbeat. It writes
into a single block that is unique for each instance, thus intra-instance coordination is not
required, this block is called the checkpoint progress record. You can see the controlfile
records using the gv$controlfile_record_section view, all members attempt to obtain a
lock on the controlfile record for updating, the instance that obtains the lock tallies the
votes from all members, the group membership must conform to the decided (voted)
membership before allowing the GCS/GES reconfiguration to proceed, the controlfile
vote result is stored in the same block as the heartbeat in the control file checkpoint
progress record.
1. Name service is frozen, the CGS contains an internal database of all the
members/instances in the cluster with all their configurations and servicing
details.
2. Lock database (IDLM) is frozen, this prevents processes from obtaining locks on
resources that were mastered by the departing/dead instance
3. Determination of membership and validation and IMR
4. Bitmap rebuild takes place, instance name and uniqueness verification, GCS must
synchronize the cluster to be sure that all members get the reconfiguration event
and that they all see the same bitmap.
5. Delete all dead instance entries and republish all names newly configured
6. Unfreeze and release name service for use
7. Hand over reconfiguration to GES/GCS
Oracle server management configuration tools include a diagnostic and tracing facility
for verbose output for SRVCTL, GSD, GSDCTL or SRVCONFIG.
-DTRACING.ENABLED=true -DTRACING.LEVEL=2
4. the string should look like this
In Oracle database 10g setting the below variable accomplishes the same thing, set it to
blank to remove the debugging
One of the jobs of a DBA is adding and removing nodes from a RAC environment when
capacity demands, although you should add a node of a similar spec it is possible to add a
node of a higher or lower spec.
The first stage is to configure the operating system and make sure any necessary drivers
are installed, also make sure that the node can see the shared disks available to the
existing RAC.
I am going to presume we have a two RAC environment already setup, and we are going
to add a third node.
Pre-Install Checking
You used the Cluster Verification utility when installing the RAC environment, the tools
check that the node has been properly prepared for a RAC deployment. You can run the
command either from the new node or from any of the existing nodes in the cluster
Make sure that you fix any highlighted problems before continuing.
Install CRS
Cluster Ready Services (CRS) should be installed first, this allows the node to become
part of the cluster. Adding the new node can be started from any of the existing nodes
1. Log into any of the existing nodes as user oracle then run the below command,
the script below starts the OUI GUI tool, hopefully the tool will already see the
existing cluster and will fill in the details for you
$ORA_RS_HOME/oui/bin/addnode.sh
2. In the specify cluster nodes to add to installation screen, enter the new names
for the public, private and virtual hosts
3. Click next to see a summary page
4. Click install, the installer will copy the files from the existing node to the new
node. Once copied you will be asked to run orainstRoot.sh and root.sh as user
root
5. Run orainstRoot.sh and root.sh in the new and rootaddnode.sh in the node that
you are running the installation from.
sets the Oracle inventory in the new node and set ownerships and
orainstRoot.sh
permissions to the inventory
checks whether the Oracle CRS stack is already configured in the new
node, creates /etc/oracle directory and adds the relevant OCR keys to the
root.sh
cluster registry and it adds the daemon to CRS and starts CRS in the new
node.
configures the OCR registry to include the new nodes as part of the
rootaddnode.sh
cluster
6.
7. Click next to complete the installation. Now you need to configure Oracle
Notification Services (ONS). The port can be identified by the below command
cat $ORA_CRS_HOME/opmn/conf/ons.config
8. Now run the ONS utility by supplying the <remote_port> number obtained above
Once the CRS has been installed and the new node is in the cluster, it is time to install the
Oracle DB software. Again you can use any of the existing nodes to install the software.
1. Log into any of the existing nodes as user oracle then run the below command,
the script below starts the OUI GUI tool, hopefully the tool will already see the
existing cluster and fill in the details for you
$ORA_RS_HOME/oui/bin/addnode.sh
2. Click next on the welcome screen to open the specify cluster nodes to add to
installation screen, you should have a list of all the existing nodes in the cluster,
select the new node and click next
3. Check the summary page then click install to start the installation
4. The files will be copied to the new node, the script will ask you to run run.sh on
the new node, then click OK to finish off the installation
Configuring the Listener
1. Login as user oracle, and set your DISPLAY environment variable, then start the
Network Configuration Assistant
$ORACLE_HOME/bin/netca
2. Choose cluster management
3. Choose listener
4. Choose add
5. Choose the the name as LISTENER
Run the below to create the database instance on the new node
1. Login as oracle on the new node, set the environment to database home and then
run the database creation assistant (DBCA)
$ORACLE_HOME/bin/dbca
2. In the welcome screen choose oracle real application clusters database to
create the instance and click next
3. Choose instance management and click next
4. Choose add instance and click next
5. Select RACDB (or whatever name you gave you RAC environment) as the
database and enter the SYSDBA and password, click next
6. You should see a list of existing instances, click next and on the following screen
enter ORARAC3 as the instance and choose RAC3 as the node name (substitute
any of the above names for your environment naming convention)
7. The database instance will now created, click next in the database storage
screen., choose yes when asked to extend ASM
Removing a Node
1. From node 1 run the below command to stop ASM on the node to be removed
cd $ORACLE_HOME/admin
rm -rf +ASM
cd $ORACLE_HOME/dbs
rm -f *ASM*
3. Check that /etc/oratab file has no ASM entries, if so remove them
1. Login as user oracle, and set your DISPLAY environment variable, then start the
Network Configuration Assistant
$ORACLE_HOME/bin/netca
2. Choose cluster management
3. Choose listener
4. Choose Remove
5. Choose the the name as LISTENER
cd $ORACLE_HOME/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME
"CLUSTER_NODES={rac3}" -local
./runInstaller
2. Choose to deinstall products and select the dbhome
3. Run the following from node 1
cd $ORACLE_HOME/oui/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME
"CLUSTER_NODES={rac1,rac2,rac3}"
cd $CRS_HOME/install
./rootdelete.sh
3. Now run the following from node 1 as user root, obtain the node number first
$CRS_HOME/bin/olsnodes -n
cd $CRS_HOME/install
./rootdeletenode.sh rac3,3
4. Now run the below from the node to be removed as user oracle
cd $CRS_HOME/oui/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME
"CLUSTER_NODES={rac3}" CRS=TRUE -local
./runInstaller
5. Choose to deinstall software and remove the CRS_HOME
6. Run the following from node as user oracle
cd $CRS_HOME/oui/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME
"CLUSTER_NODES={rac1,rac2,rac3}" CRS=TRUE
7. Check that the node has been removed, the first should report "invalid node", the
second you should not see any output and the last command you should only see
nodes rac1 and rac2
Cheatsheet
This is a quick and dirty cheatsheet on Oracle RAC 10g, as my experience with RAC
grows I will update this section, below is a beginners guide on the commands and
information that you will require to administer Oracle RAC.
Acronyms
Acronyms
Global Cache in memory database containing current locks and awaiting locks,
GCS
Services also known as PCM
Global
coordinates the requests of all global enqueues uses the GCS, also
GES Enqueue
known as non-PCM
Services
Global
all resources available to the cluster (formed and managed by
GRD Resource
GCS and GES), see GRD for more details
Directory
Global
helps to coordinate and communicate the locks requests between
GRM Resource
Oracle processes
Manager
runs on each node with one GSD process per node. The GSD
coordinates with the cluster manager to receive requests from
Global
clients such as the DBCA, EM, and the SRVCTL utility to
GSD Services
execute administrative job tasks such as instance startup or
Daemon
shutdown. The GSD is not an Oracle instance background process
and is therefore not started with the Oracle instance
Parallel
PCM formly know as (integrated) Distributed Lock Manager, its
Cache
(IDLM) another name for GCS
Management
it is a identifiable entity it basically has a name or a reference, it
Resource n/a
can be a area in memory, a disk file or an abstract entity
a resource that can be accessed by all the nodes within the cluster
examples would be the following
Resource
n/a • Data Buffer Cache Block
(Global)
• Transaction Enqueue
Useful Views/Tables
GCS and Cache Fusion Diagnostics
contains information about every cached block in the buffer
v$cache
cache
contains information from the block headers in SGA that
v$cache_transfer
have been pinged at least once
contains information about the transfer of cache blocks
v$instance_cache_transfer
through the interconnect
contains statistics about CR block transfer across the
v$cr_block_server
instances
contains statistics about current block transfer across the
v$current_block_server
instances
contains one-to-one information for each global cache
v$gc_element
resource used by the buffer cache
GES diagnostics
contains information about locks held within a database and
v$lock
outstanding requests for locks and latches
contains information about locks that are being blocked or
v$ges_blocking_enqueue blocking others and locks that are known to the lock
manager
v$enqueue_statistics contains details about enqueue statistics in the instance
v$resource_limits display enqueue statistics
contains information about DML locks acquired by different
v$locked_object
transactions in databases with their mode held
v$ges_statistics contains miscellaneous statistics for GES
contains information about all locks known to the lock
v$ges_enqueue
manager
v$ges_convert_local contains information about all local GES operations
v$ges_convert_remote contains information about all remote GES operations
contains information about all resources known to the lock
v$ges_resource
manager
v$ges_misc contains information about messaging traffic information
v$ges_traffic_controller contains information about the message ticket usage
Dynamic Resource Remastering
contains information about current and previous master
v$hvmaster_info instances of GES resources in relation to hash value ID of
resource
v$gcshvmaster_info the same as above but globally
conatins information about current and previous masters
about GCS resources belonging to files mapped to a
v$gcspfmaster_info
particular master, including the number of times the
resource has remastered
Cluster Interconnect
contains information about interconnects that are being used
v$cluster_interconnects
for cluster communication
same as above but also contains interconnects that AC is
v$configured_interconnects
aware off that are not being used
Miscellanous
v$service services running on an instance
x$kjmsdp display LMS daemon statistics
x$kjmddp display LMD daemon statistics
Useful Parameters
Parameters
cluster_interconnects specify a specific IP address to use for the inetrconnect
_gcs_fast_config enables fast reconfiguration for gcs locks (true|false)
controls which instance will hold or (re)master more
_lm_master_weight
resources than others
controls the number of resources an instance will master at
_gcs_resources
a time
_lm_tickets controls the number of message tickets
controls the number of message tickets (aggressive
_lm_ticket_active_sendback
messaging)
limits the number of CR copies per DBA on the buffer
_db_block_max_cr_dba
cache (see grd)
used when too many CR requested arrive for a particular
_fairness_threshold
buffer and the block becomes disowned (see grd)
_gc_affinity_time specifies interval minutes for reamstering
defines the number of times a instance access the resource
_gc_affinity_limit
before remastering
defines the minimum number of times a instance access the
_gc_affinity_minimum
resource before remastering
disables dynamic remastering for the objects belonging to
_lm_file_affinity
those files
_lm_dynamic_remastering enable or disable remastering
define the time by which an instance deferred downgrading
_gc_defer_time
a lock (see Cache Fusion)
_lgwr_async_broadcast change the SCN boardcast method (see troubleshooting)
Processes
General Administration
Managing the Cluster
/etc/init.d/init.crs start
starting
crsctl start crs
/etc/init.d/init.crs stop
stopping
crsctl stop crs
/etc/init.d/init.crs enable
/etc/init.d/init.crs disable
enable/disable at
boot time
crsctl enable crs
crsctl disable crs
Managing the database configuration with SRVCTL
srvctl start database -d <database> -o <option>
Note: starts listeners if not already running, you can use the -o option
to specify startup/shutdown options
start all instances
force
open
mount
nomount
stop all instances srvctl stop database -d <database> -o <option>
Note: the listeners are not stopped, you can use the -o option to
specify startup/shutdown options
immediate
abort
normal
transactional
start/stop
srvctl [start|stop] database -d <database> -i <instance>,<instance>
particular instance
display the
registered srvctl config database
databases
srvctl status database -d <database>
srvctl status instance -d <database> -i <instance>,<instance>
status srvctl status service -d <database>
srvctl status nodeapps -n <node>
srvctl status asm -n <node>
srvctl stop database -d <database>
srvctl stop instance -d <database> -i <instance>,<instance>
srvctl stop service -d <database> -s <service>,<service> -i
<instance>,<instance>
srvctl stop nodeapps -n <node>
srvctl stop asm -n <node>
stopping/starting
srvctl start database -d <database>
srvctl start instance -d <database> -i <instance>,<instance>
srvctl start service -d <database> -s <service>,<service> -i
<instance>,<instance>
srvctl start nodeapps -n <node>
srvctl start asm -n <node>
srvctl add database -d <database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node>
srvctl add service -d <database> -s <service> -r <preferred_list>
srvctl add nodeapps -n <node> -o <oracle_home> -A <name|
ip>/network
srvctl add asm -n <node> -i <asm_instance> -o <oracle_home>
adding/removing
srvctl remove database -d <database> -o <oracle_home>
srvctl remove instance -d <database> -i <instance> -n <node>
srvctl remove service -d <database> -s <service> -r <preferred_list>
srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|
ip>/network
srvctl asm remove -n <node>
OCR utilities
log file $ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
ocrcheck
checking
Note: will return the OCR version, total space allocated, space used,
free space, location of each device and the result of the integrity check
ocrdump -backupfile <file>
dump contents
Note: by default it dumps the contents into a file named OCRDUMP
in the current directory
ocrconfig -export <file>
export/import
ocrconfig -restore <file>
# show backups
ocrconfig -showbackup
# to change the location of the backup, you can even specify a ASM
disk
ocrconfig -backuploc <path|+asm>
# perform a restore
ocrconfig -restore <file>
# delete a backup
orcconfig -delete <file>
Note: there are many more option so see the ocrconfig man page
## add/relocate the ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf'
CRS Administration
CRS Administration
starting ## Starting CRS using Oracle 10g R1
not possible
## Starting CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs
## Stopping CRS using Oracle 10g R1
srvctl stop -d database <database>
srvctl stop asm -n <node>
srvctl stop nodeapps -n <node>
stopping
/etc/init.d/init.crs stop
## Oracle 10g R1
disabling/enabling /etc/init.d/init.crs [disable|enable]
## Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl [disable|enable] crs
$ORA_CRS_HOME/bin/crsctl check crs
$ORA_CRS_HOME/bin/crsctl check evmd
checking $ORA_CRS_HOME/bin/crsctl check cssd
$ORA_CRS_HOME/bin/crsctl check crsd
$ORA_CRS_HOME/bin/crsctl check install -wait 600
Resource Applications (CRS Utilities)
status $ORA_CRS_HOME/bin/crs_stat
create profile $ORA_CRS_HOME/bin/crs_profile
register/unregister $ORA_CRS_HOME/bin/crs_register
application $ORA_CRS_HOME/bin/crs_unregister
Start/Stop an $ORA_CRS_HOME/bin/crs_start
application $ORA_CRS_HOME/bin/crs_stop
$ORA_CRS_HOME/bin/crs_getparam
Resource permissions
$ORA_CRS_HOME/bin/crs_setparam
Relocate a resource $ORA_CRS_HOME/bin/crs_relocate
Nodes
member number/name olsnodes -n
local node name olsnodes -l
activates logging olsnodes -g
Oracle Interfaces
display oifcfg getif
delete oicfg delig -global
set oicfg setif -global <interface name>/<subnet>:public
oicfg setif -global <interface
name>/<subnet>:cluster_interconnect
Global Services Daemon Control
starting gsdctl start
stopping gsdctl stop
status gsdctl status
Cluster Configuration (clscfg is used during installation)
create a new
clscfg -install
configuration
upgrade or downgrade
clscfg -upgrade
and existing
clscfg -downgrade
configuration
add or delete a node clscfg -add
from the configuration clscfg -delete
create a special single-
node configuration for clscfg -local
ASM
brief listing of
terminology used in the clscfg -concepts
other nodes
used for tracing clscfg -trace
help clscfg -h
Cluster Name Check
cemulto -n
print cluster name
Note: in Oracle 9i the ulity was called "cemutls"
cemulto -w
print the clusterware
version
Note: in Oracle 9i the ulity was called "cemutls"
Node Scripts
addnode.sh
Add Node
Note: see adding and deleting nodes
deletenode.sh
Delete Node
Note: see adding and deleting nodes
Enqueues
displaying statistics SQL> column current_utilization heading current
SQL> column max_utilization heading max_usage
SQL> column initial_allocation heading initial
SQL> column resource_limit format a23;
Voting Disk
adding crsctl add css votedisk <file>
deleting crsctl delete css votedisk <file>
querying crsctl query css votedisk
Books
Oracle RAC Books
Although this book is lightweight compared to other
Oracle books, it has enough detail to give you what
Oracle 10g Real Application Clusters
you need to manage RAC, however i did have to
Handbook
consult the web in order to obtain more detailed
information and to clarify certain points.
Oracle 10g High Availability with This book had small section on RAC and helped
Rac, Flashback and Data Guard clarify some of the points in the above book