Cluster in Detail

Clustering: For Geeks... & for Normal People Too!
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
UPDATE SERVER
Push
REPLICA
SERVER
UPDATE
George Chiesa <chiesa@dotNSF.com> Daniel Nashed <nsh@nashcom.de>
Pull
s Pu
VIEW DATA
ll Pu
DATA VIEW
DATABASE
DATABASE
UPDATE
SERVER
CLREPL (replica)
SERVER
UPDATE
Push
sh Pu
VIEW DATA DATA VIEW
DATABASE
DATABASE
This Presentation was not researched nor conceived at the British Library
This was not conceived at BL.uk

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
License: You have a limited license to this presentation.

Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization . For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed" We request this presentation NOT to be publicly reposted, please ! Public abstracts will be posted at http://dotNSF.com & http://nashcom.de
This is bubble-bath-ware!
Disclaimers: NO Proofs...
Ok, just one hack from a red book where I wrote something in...
This presentation is based upon empyrical info

Observed behaviours, features, bugs, beyond... I can NOT prove many of the hypothesis here Please accept these pearls of wisdom "as is" Some of this information may be obsolete soon but it's useful to know what the state of art is
Download and get this redbook: SG24-7017 Lotus Security Handbook (2004)
Hint: firefox's "modify header" plugin extension (free)
We ALWAYS report security issues to IBM in private.

and no, we will not discuss security bugs (all fixed:-)
If you are using Reverse Proxies:

What is "Clustering for Geeks"

Clustering 101 (definitions/vocabulary)
Clustering For Geeks"is the art of using documented functionality and "stable observed behaviours" to "automagically" provide a better and cheaper servICE (not serVER) In some cases, thinking quite outside of the box pushing the product to the limits !
The 50/50 rule/s:

What we're covering today

60' version of a much longer workshop... what is called "1352 Native Clustering" Which pieces are client/server based How each major piece work "per se" How to make the puzzle work for you
50% of what you KNOW about clusters...
is quite useless !
50% of what you don't know about clusters
is quite useful !!!

Value Proposition 50%+50%=100%
50% of DDTs (Don't Do That!)s And 50% of DO this !
About questions...
Once upon a time... last millenium...

IT IS "OK"(not impolite)
To interrupt... to ASK questions... 'ala' easyjet... "within reason" :-)
100% of what you do not understand can, and WILL probably hurt you!
The STATE of the ART in 1995...

was THIN ethernet (ethernet 10 as in 10Mb) if you were an IBM SHOP, you had TR/4/16 Each adaptor had one and only one address And in 1995 LOTUS was already shipping
Clustering and Failover embedded in Notes 4.01 (at the time called NPN=Notes Public Networks)
We reserve the right to postpone the answers, but, when in doubt, raise hand!
So a LOT within Notes has a strong LEGACY. So, we're going to provoke your brain to think!
Server Configured in 1995...

This is the MOST controversial!
If I were you I would use...

JUST ONE TCPIP NOTES PORT
You can still have as many addresses You can still listen to 0.0.0.0 in notes.ini You can still have complex tcpip routing tables
YOU DO NOT NEED THE EXTRA LOGIC

of Notes trying to cope with Ethernet 10 and just one IP address per physical card.
K.I.S.S. (at the Notes/Domino Layer!!!)
Stay awake, more controversy to come...
Listen...(Bonus HACK): ( 42 443 )

This how I connect to my server

This time the answer is not 42 ;-) but instead: 443! You can specity what you are "listening to" You must understand netstat -an | find "LISTEN" If you bind addresses you will listen just that BUT You CAN specify "0.0.0.0" as a specific address! You can use this to listen to all addresses at a port
Example: You can set a notes server to also listen on NRPC to port 443 on 0.0.0.0 this is a useful hack when you are behind a proxy and want to access your home server and the proxy only allows access to ports 80 and 443 port 443 proxies use transparent "connect method"
When visiting customers Using http proxies and not allowing 1352 direct. If cust agrees to allow me to connect to my own server while at their premises...using their proxy
PORTS=TCPIP,TCPIP2 TCPIP=TCP,0,15,0,,45088, TCPIP_TCPIPADDRESS=0,0.0.0.0:1352 TCPIP2=TCP,0,15,0,,45088, TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
HACK! How does that work?

In my server's Notes.ini
PORTS=TCPIP,TCPIP2 TCPIP=TCP,0,15,0,,45088, TCPIP_TCPIPADDRESS=0,0.0.0.0:1352 TCPIP2=TCP,0,15,0,,45088, TCPIP2_TCPIPADDRESS=0,0.0.0.0:443
Cluster Aware "1352" Notes Clients: a.k.a. Cluster-READY clients
Definition:
A Notes Client is said to be cluster-aware when it will perform custom logic to transparently and automatically fail-over from one server to another, upon server directive or LACK of reply
QUIZ:
what % of Notes Clients are CLUSTER Aware? hint: what was the first version of Cluster Aware Notes client?
Voila': I can connect using HTTP Proxy "transparent connect method" to 443
If I told you Notes 4.01 was the first one...
Cluster.NCF (client side)

Servers also use it to connect to other servers!
Clustering
Time=22/12/2001 14:26:46 (80256B2A:004F5AD8) Cluster/NotesWeb CN=Notes2/O=Notesweb CN=Notes1/O=Notesweb Time=03/01/2002 16:18:24 (80256B36:0059935B) TheConifers.com CN=dotNSF.TheConifers.com/O=TheConifers CN=Linux.TheConifers.com/O=TheConifers CN=WebSphere.TheConifers.com/O=TheConifers CN=Win2k.TheConifers.com/O=TheConifers CN=www.TheConifers.com/O=TheConifers
COMPLEX SET of design methodologies, techniques and heuristics

applied to "stuff" that you can use to "make" "n" things to be perceived as ONE
bigger/better & "more reliable"
The key words of this slide are "PERCEIVED as" NB: We're going to focus on
MultiPlatform SOFTWARE Clustering
Perspective...
Cluster Examples: 3, 5 or 20+

The "i" in RAID stands for: In-Expensive

In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published "A Case for Redundant Arrays of Inexpensive Disks (RAID)" . This paper described various types of disk arrays, referred to by the acronym RAID. The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive.
Cluster.ncf: (default max 2 mates TIMES 20 clusters, LKB 185700: Cluster_Name_Cache_Size=n (notes.ini)
Clustering & Failover in Action
Server QUIT while reading...

Cluster Mates:
"Mate" is an industry NON-PC (non politically correct!) std term
Server Tasks involved

cladmin Servertask in R5
takes care about administrative things (D6+ not in servertasks=, launched automatically)
Definition:
A cluster of something is composed of mates logically siblings among them (no master) Domino Wise, a Cluster Mate can be: Available (normal) (SAI>SAT) Busy (Server_Availability_Index <= Server_Availability_Threshold)
Tip: You CAN BUSY a server by setting SAT=100
cldbdir takes care that cluster directory is up to date

(D6+ not in servertasks=, launched automatically)
clrepl pushes changes to other replicas based on information from cluster directory
(D6+ not in servertasks=, launched automatically) logs periodically into replication log (manual: tell clrepl log)
Unavailable (or unreacheable/perceived as such) Restricted (Temp=1 or Perm=2) Invalid (never contacted)
replica should still be active as a fallback and to init replicas!
Server regularly check state of their Cluster Mates

Portfolio techiques / Sizing heuristics

API Level call NSPingServer

gives back a list of cluster mates and the availability
You can check this information via

> show cluster Cluster Information Cluster name: nsh-cluster, Server name: nsh-dus-02/Srv/NashCom/DE Server cluster probe timeout: 1 minute(s) Server cluster probe count: 185 Server availability threshold: 0 Server availability index: 100 (state: AVAILABLE) Cluster members (2)... server: nsh-dus-02/Srv/NashCom/DE, availability index: 100 server: nsh-dus-01/Srv/NashCom/DE, availability: 42
There are always 2 practical limits:

Lower: at LEAST how many you need to reduce risk Upper: at MOST hoy many can you manage effectively Tip: Start with 3 or 4, fine tune afterwards but please
do NOT start with 2 or 6
Class of service: by "n" instances of resource

Almost Real Time Replication...

Say, for the purpose of example, you have "3"

"whatevers": OSs, Sites, Servers, Routers, ISPs say you name the 3 elements as A B and C
a) we need to define how we will syncronize

Bad News:
Scheduled replication not good enough... Some apps must be cluster aware enabled!
With 3 elements you can define the following

Classes of Service: Top, simultaneously present in A+B+C Middle, present in either: AB, AC or BC Single, present just in A or B or C
Good News:
NATIVE Event/Queue Driven = CLREPL = (aka Almost Real Time) Most apps will automatically work better
Homework: Try the combinations for 4 units,

C(4,4) + C(4,3) + C(4,2) + C(4,1)
b) we still need to spread the load/access.
Nota benissimo: DO STOP AT 4 ! ! !
ClDbDir
ClDbDir (contents)
It's a Notes Database, similar to catalogue,

Cluster Specific (RepId depends on ClusterName)
Maintained by a server task of the same name It's in the Enterprise Edition of Domino Contains info about databases deployed in a cluster Is used by Notes/Domino Cluster Aware modules
to know where to push what (and what NOT to!!!) and for "failovers": a server finds resource elsewhere!
Like CATALOG, each server updates its OWN dbs BEWARE: 8192 maximun number of useful entries; you do NOT get a warning NOR Error message!
Bonus Hack: Set Config Cluster_Admin_On=1 It also works IN NON Clustered servers!
From LKB: How Push-Pull (std) Replica works

UPDATE SERVER
You can afterwards do:

CL DEL filename (cluster delete) CL COPY source dest REPLICA CL OUT database (out of service) CL IN database (in service again(
both work but are only meaningful in clusters
Push
REPLICA
SERVER
UPDATE
Pull
sh Pu
VIEW DATA
Pu
ll
DATA VIEW
Useful to OUT-of-service databases

BEFORE adding an OLD server to a cluster useful for decomissioning an old server you HAVE to add a server to get it into the CLIENT's Cluster.NCF
DATABASE
DATABASE
From LKB: How Push Cluster Replica works !

How does Cluster Replication works (details)

UPDATE
UPDATE
SERVER
CLREPL (replica)
SERVER
Push
s Pu
VIEW DATA
h
DATA VIEW
Document changes are captured and trigger the cluster Replicator via a message queue Cluster Replicator reads message queue and pushes changes to other all other replicas in the cluster regardless of replication settings (aka almost "real time" replication)
DATABASE
DATABASE
CLREPL
ClRepl (cont'd)
CLREPL is a server task It's an in-Memory QUEUE driven event replicator (REMEMBER BATH TUB !) that SHOULD push content
at most within 15 seconds - in average 7
ClRepl PUSHES content modified locally to all cluster mates containing replicas of the modified database Tips:
It PUSHES ignoring source ACL Check that the queue is not over filled Always schedule CLASS+1 of them
NB: CLREPL does NOT initialize "Replica Stubs" It also knows what YES/NOT to push Out Of Service (for quite obvious reasons) but also Pending Delete (cldbdir does final push, not clrepl !)
thus ClRepl is also sometime called RTR or "ALMOST" REAL TIME REPLICATOR
the KEY here is in "ALMOST"
ClRepl (cont'd)
Cluster Replicator Performance & Statistics

General Rule: number of clrepl = cluster members "minus" 1
R5: servertasks=events4, repl, router, clrepl, clrepl, clrepl, ... D6: Cluster_Replicators=n My Tip, set to CLASS_OF_SERVICE PLUS one, not minus one, over schedule it and it's cheap, underschedule it and you will have problems!
ClRepl will keep an IN-memory queue It's a QUEUE, and can be overfilled It's in MEMORY and is NOT disk persistent THUS, also schedule normal replicas: Tips:
within reason, overschedulling pull replicas is not a huge issue, because the deltas are small i.e. Enabled Replica From */Srv/Whatever to <each>/Srv/Whatever, PULL, every 60 Mins Will make servers catch up fast, pulling at restart time.
Check if clustering works properly via Show Stat Replica.Cluster.*

Replica.Cluster.WorkQueueDepth should be "small", i.e. less than 10 Replica.Cluster.RetryWaiting should be also "small" i.e. less than 5 Replica.Cluster.Failed should be zero if possible (easy to say :-) Check the Max and Average Times in queue, should be < 10 seconds
TIP: SH ST REPLICA.CLUSTER.*Q*
(Daniel to explain detail stats)
Show Stat Server.Cluster.*

Server.Cluster.OpenRedirects.xxx.Unsuccessful = 0 check for unsuccessful redirects!
How to restrict access (LKB 7002910)

SAI examples, un/touched

Domino server clusters have an optional workload balancing feature that lets you distribute the workload of heavily-used databases across multiple servers in a cluster. To distribute workload, you limit or restrict the work that a server can perform using the following settings in the NOTES.INI: Server_Availability_Threshold
This setting allows you to specify the maximum availability level beyond which the server attempts to redirect user requests to other servers in the cluster. A server's availability index is recalculated each minute and compared against any threshold you set. If the index falls below the server threshold, the server becomes BUSY. The Cluster Manager redirects access requests from a BUSY server to the servers in the cluster. When an attempt to redirect is unsuccessful, the user receives access to the BUSY server. Each time a redirection occurs, Notes generates a workload balancing event in the Notes log (LOG.NSF).
Server_MaxUsers
This setting specifies the maximum number of user sessions allowed on a server. When the server reaches this limit, the server goes into a MAXUSERS state. The Cluster Manager then attempts to redirect new user request to other servers in the cluster. To see how often requests are being redirected, check the LOG.NSF for failover events. If redirection of the user request is unsuccessful, the user receives a message, and is not allowed access to the server.
You may want to smooth this (or not)
Server_Restricted
This setting enables a server to deny new open database requests and places the server in a RESTRICTED state. Users who have active connections to databases retain their connections. The Cluster Manager attempts to redirect new requests to other servers in the cluster. When an attempt to redirect is unsuccessful, the user receives a message and is not allowed access to the server. For each redirection attempt, Notes generates a failover event in the LOG.NSF. Note: You can use the Server_Restricted setting for any Domino server. This setting is not restricted to clusters.
Best Practices for Cluster Replication

Cluster Replication & Database Quotas

There are issues with Database Quotas before R5.0.10 Good news:
New option in R5.0.10 CLREPL_OVERRIDE_QUOTAS=1 Domino 6 overrides quotas by default you get the old behavior with Clrepl_Obeys_Quotas=1 (DDT)
Ensure you have full manager access for LocalDomainServers as a Server group or better */Srv/Org as Manager of type Server in all ACLs.. I prefer hardcoding OUs to groups. Works always! Make sure all applications provide roles to give access to documents with reader fields (remember computed auth fields) Give Servers all rights and roles to "see" all documents Don't use replication formulas for clustered databases Have a scheduled replication in case some events in the clrep-queue get lost or the server is down... Add startup replication documents "from *" to ensure databases are up to date after server restart Schedule replication to the Name of the cluster instead of single server names (load balancing & failover)
Bad news:
If you already have this problem you need to delete replication history and CutOff Date to resolve existing replication problems Lotus Script can clear the replication history Set rep = db.ReplicationInfo , Call rep.ClearHistory() , Call rep.Save() But not remove the CutOffDate (in most cases not needed)
Notes Named Network & Directory Assistance

Changes/Recommendations
Customer was using Notes Named Networks (NNN) across WAN connection
Caused unintended traffic
Only local servers in the same NNN Use only local directories in (DA)
Used "*" to specify the local replica only (TN #1087708) Evaluating Extended Directory Catalog to further optimization Directory catalog could simplify working with external addresses and allow more flexibility
Directory Assistance (DA)

Multiple replicas of 4 Directories where used First Server in the list was a remote server in the same NNN in some cases! Changed configuration to use the local server only All servers had replicas of all directories One external directory had huge number of deletion stubs due to external company always reimporting the directory :-(
Avoid large number of changes in Domino directories

Less need to update views in Domino Directory Less deletion stubs Not the first time we have seen nightly complete delete/add import agents in customer environments
How to use NNNs (KISS)

Other High Availability Tips

Domino 6/7 support multiple versions on one logical UNIX/Linux box
much easier update and coexistence of multiple releases and allows to have a easy to handle "go back" scenario
One for TCPIP (and one per Cluster Port )
Fault-Recovery
Maximize server availability Faster Server Restart after crash! Automatic collect NSDs for faster troubleshooting
Domino Server Availability Index (SAI or AI)

LoadMon
Domino 6+ uses a new algorithm to calculate the workload of a server and the resulting AI
A number of customers reported unpredictable, alternating AI which caused Clustering to fail. Algorithm was enhanced in D6.0.2CF2 and additional notes.ini parameters have been introduced. But there is another bug that is hopefully finally fixed in D6.5.6 and D7.0.2! We traced AI at customer site Live Environment Test Environment with Server.Load
Domino 6/7 use a module called "LoadMon"

Routine calculating speed of current transactions, summarizes and compares them with previous intervals and minimum values (RunningAvgTime & MinAvgTransTime) unit: microseconds
OPEN_DB OPEN_NOTE CLOSE_DB DB_INFO_GET DB_REPLINFO_GET GET_OBJECT_SIZE READ_OBJECT GET_SPECIAL_NOTE_ID DB_READ_HIST DB_WRITE_HIST SERVER_AVAILABLE_LITE NIF_OPEN_NOTE
Expansion Factor (XF)

LoadMon Notes.ini Settings

SERVER_TRANSINFO_MAX (default 5 / max 60) number of statistics collections stored in LoadMon SERVER_TRANSINFO_UPDATE_INTERVAL (default 15) interval for statistics capturing & calculation SERVER_MIN_TRANS (default 5) minimum transactions needed for a statistic value to be valid SERVER_TRANSINFO_NORMALIZE (default 3000) SERVER_TRANSINFO_HTTP_NORMALIZE (12000) as far we found out used to initialize empty statistics (zero in loadmon.ncf) on startup in Domino 6
XF is calculated based on the performance values of current transactions in relation to minimum time for a transaction
It's the number of times the current transactions take longer than the minimum transaction time XF values for different transactions build a overall XF This XF is computed and converted into AI based on a Range to scale the XF (TN #1112352) Notes.ini Server_Transinfo_Range n is 6 by default and specifies the maximum Expansion Factor of a Domino Server. The XF is calculated 2 raised to the power n (64 by default)
Debugging LoadMon
se co DEBUG_LOADMON=1
Server.LoadMon.TransInfo.AI.Type = 0 Server.LoadMon.TransInfo.CurrentTransCount.CLOSE_DB = 3 Server.LoadMon.TransInfo.CurrentTransCount.DB_INFO_GET = 2 Server.LoadMon.TransInfo.CurrentTransCount.DB_READ_HIST = 0 Server.LoadMon.TransInfo.CurrentTransCount.DB_REPLINFO_GET = 5 Server.LoadMon.TransInfo.CurrentTransCount.DB_WRITE_HIST = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_NOTE_INFO = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_OBJECT_SIZE = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_SPECIAL_NOTE_ID = 0 Server.LoadMon.TransInfo.CurrentTransCount.NIF_OPEN_NOTE = 0 Server.LoadMon.TransInfo.CurrentTransCount.OPEN_DB = 3 Server.LoadMon.TransInfo.CurrentTransCount.OPEN_NOTE = 7 Server.LoadMon.TransInfo.CurrentTransCount.READ_OBJECT = 0 Server.LoadMon.TransInfo.CurrentTransCount.SERVER_AVAILABLE_LITE = 2 Server.LoadMon.TransInfo.HttpNormalize = 12000 Server.LoadMon.TransInfo.IntervalInSeconds = 15 Server.LoadMon.TransInfo.Max = 5 Server.LoadMon.TransInfo.MinAvgTransTime.CLOSE_DB = 58.1818181818182 46 statistics found Server.LoadMon.TransInfo.MinAvgTransTime.DB_INFO_GET = 119.875 Server.LoadMon.TransInfo.MinAvgTransTime.DB_READ_HIST = 210.666666666667 Server.LoadMon.TransInfo.MinAvgTransTime.DB_REPLINFO_GET = 88.5714285714286 Server.LoadMon.TransInfo.MinAvgTransTime.DB_WRITE_HIST = 240.2 Server.LoadMon.TransInfo.MinAvgTransTime.GET_NOTE_INFO = 110.235087719298 Server.LoadMon.TransInfo.MinAvgTransTime.GET_OBJECT_SIZE = 141.777777777778 Server.LoadMon.TransInfo.MinAvgTransTime.GET_SPECIAL_NOTE_ID = 93.333333333 Server.LoadMon.TransInfo.MinAvgTransTime.NIF_OPEN_NOTE = 1,031.4285714286 Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_DB = 429.166666666667 Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_NOTE = 272.987714987715 Server.LoadMon.TransInfo.MinAvgTransTime.READ_OBJECT = 134.285714285714 Server.LoadMon.TransInfo.MinAvgTransTime.SERVER_AVAILABLE_LITE = 95.3333333 Server.LoadMon.TransInfo.MinTrans = 5 Server.LoadMon.TransInfo.Normalize = 3000 Server.LoadMon.TransInfo.Range = 15 Server.LoadMon.TransInfo.RunningAvgTime.CLOSE_DB = 214.333333333333 Server.LoadMon.TransInfo.RunningAvgTime.DB_INFO_GET = 172 Server.LoadMon.TransInfo.RunningAvgTime.DB_READ_HIST = 0 Server.LoadMon.TransInfo.RunningAvgTime.DB_REPLINFO_GET = 187 Server.LoadMon.TransInfo.RunningAvgTime.DB_WRITE_HIST = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_NOTE_INFO = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_OBJECT_SIZE = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_SPECIAL_NOTE_ID = 0 Server.LoadMon.TransInfo.RunningAvgTime.NIF_OPEN_NOTE = 0 Server.LoadMon.TransInfo.RunningAvgTime.OPEN_DB = 4,143 Server.LoadMon.TransInfo.RunningAvgTime.OPEN_NOTE = 738 Server.LoadMon.TransInfo.RunningAvgTime.READ_OBJECT = 0 Server.LoadMon.TransInfo.RunningAvgTime.SERVER_AVAILABLE_LITE = 104
debug_loadmon=1
Enables LoadMon Debugging, writes additional information to server console
07.10.2003 07:08:09 Loadmon: Domino AI = 100, XF = 1
And adds additional 46 statistics counters (server.loadmon.*) Can be captured locally or remotely via "show server" or statistics collection program. nstats servername or C-API NSFGetServerStats (...)
loadmon.ncf
loadmon.ncf in Domino data directory stores last information from loadmon before server is shutdown loaded on server start to initialize statistics counters
BEWARE LARGE OVERFLOW INTO NEGATIVE VALUES Quit, delete loadmon.ncf, restart server (do after upgrades!)
What did we find out?

AI in D6.0.1 without Optimizing of Loadmon
Domino 6.0.1 AIX 5.1 dropping AI
Listen...(HACK 2)
You need to understand which fields are
Listens (usually in specific tabs) HostNames that are NOT Listens for example:
you can tell domino that it's HTTP hostname is the name of something else even in a different machine urls will be created nicely
AI with default interval 15 sec and 5 sampling values does not always result in steady AI
we needed to find values which provide steady values for cluster-failover not to occur "randomly" or cause Ping-Pong effects reasonable time to reflect current workload in AI Standard interval and sampling 15*5 cover 45 seconds Interval 10 seconds with 20 sampling values cover 200 seconds Standard Server.Load Scripts do not help much because most transactions are not used in standard scripts
HACK 3: How to use clustering for server consolidation

Add ALL servers to ONE CLUSTER...
Make sure you have Dbs no more than 3 times SET the SAT of the OLD servers to 100 This will BUSY them out Users will LOADBALANCE to new servers for all NON ADMIN/Managers users Unless you forgot an app just in old servers because it will continue to access old servers
For example, to get this

Cluster name: DOMPMAC01, Server name: DOMAGP01/SRV/Customer Server cluster probe timeout: 1 minute(s) Server cluster probe count: 47191 Server cluster default port: * Server availability threshold: 100 Server availability index: 0 (state: BUSY) Server availability default minimum transaction time: 3000 Cluster members (11): Server: DOMPMA01/SRV/Customer, availability index: 81 Server: DOMPMA02/SRV/Customer, availability index: 78 Server: DOMPIN02/SRV/Customer, availability index: 65 Server: DOMPIN01/SRV/Customer, availability index: 63 Server: DOMMYP01/OLD/SRV/Customer, availability index: 0 Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 server: DOMOEP01/SRV/Customer, availability: BUSY server: DOMHEP01/SRV/Customer, availability: BUSY server: DOMCVP01/SRV/Customer, availability: BUSY server: DOMVGP01/SRV/Customer, availability: BUSY server: DOMAGP01/SRV/Customer, availability: BUSY
Fine tuning via SAI/SAT/Range

And when you turn off a server...

Cluster information: Cluster name: DOMPMAC01, Server name: DOMMYP01/SRV/Customer Server cluster probe timeout: 1 minute(s) Server cluster probe count: 62831 Server cluster default port: * Server availability threshold: 0 Server availability index: 28 (state: AVAILABLE) Server availability default minimum transaction time: 3000 Cluster members (11): Server: DOMPMA02/SRV/Customer, availability index: 79 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPMA01/SRV/Customer, availability index: 78 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPIN01/SRV/Customer, availability index: 64 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPIN02/SRV/Customer, availability index: 39 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0 Server: DOMMYP01/OLDSRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0 server: DOMHEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMVGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMCVP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMAGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMOEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100
Remember to ignore the probes failures

if annoyed increase the period of the probe Server_Cluster_Probe_Timeout=1 (minute)
Dead server do not run cldbdir, thus (hack!)

In New servers' CLDBDIR DELETE manually ALL instances of DBs in the old servers
Failover by replicaID uses the new servers! CLREPL will NOT attempt to keep dead servers updated (EXTREMELY IMPORTANT!!!!!!!!)
You can keep old dead servers

To finally delete the server

In the cluster for reasonable long time BUT you must check the logs and
sh st replica.cluster.*q* You can't have lost transactions.. because CLDBDIR thinks the old servers are EMPTY but alive CL Manager will say once a minute they are unreacheable, which is what you want for AUTOMATIC user failover... over time...
use AdminP !!!
Other Caveats/Tips/Tricks:
Always TEST failovers

You must make sure you edit the old servers' records in NAB to remove mail routing You do not want mail to be attempted to be routed via old dead servers You'd better do server decomission report
BEFORE turning them off... a machine turned off produces no reports
DO NOT remove old old server from cluster yet
with a TEST user ID that is

NON Administrator NON manager of apps databases
It is assumed that managers know

where they want to access dbs and will NOT attempt failover
if you test with ADMIN.id: will drive you MAD
Cluster Analysis
Failover
Cluster Analysis is a great feature to figure out about problems in your cluster
It's part of the Admin Client and (Server / Analysis / Cluster ...) Run it to find problems with ACL, Replication, not existing databases, ...
Definition:
Server Initiated due to reactive Load Balance or failures Client Initiated server is dead or perceived as dead requires client to know how to connect to cluster mates without server assistance! Tips: insert the address in name:
CN=<FullyQualifiedDomainName>/Whatever CN=194.196.39.11/Srv/LotusEmea/Net
Tips
Run it, print it and sign off all warnings you find Use FT Search to remove multiple
occurrences of similar or already fixed problems until DB is empty

Run Analysis again to see you addressed al problems
DEBUG_NOSTDOUT=1
DO NOT USE THIS PLEASE

If you leave debug parameters ON in prod

capture the debug in files debug_Outfile= and NOT in StdOut
DEBUG_RUN_AS_ROOT=1
it WILL allow you to run as root in UNIX/Linux it will NOT allow you later to run as non root unless you fix all the owners, permissions,etc of everything it created. (just DDT please!) Exception: Some custom restores required root
GET A NEW VERSION OF RESTORE TOOL
for performance reasons and also... for sanity of old 3rd party apps (&BACKUPs)
Replication Debugging
DEBUG_REPL=2 & DEBUG_REPL_ALL=1
sh st replica.cluster.*
(if you do not read the stats, why bother clustering?)
Log_Replication: (not ORable, different values, -1 does not work!)

Log_Replication=0....No replication logging Log_Replication=1....Logs server replication events Log_Replication=2....Adds logging of replication activity at the database level Log_Replication=3....Adds logging of replication activity at the note level Log_Replication=4....Adds logging of replication activity at the field level Log_Replication=5....Adds summary logging
RTR_Logging: (Tip: You can OR (sum) these, i.e. 63 is a LOT!)

RTR_Logging= 1....Default level of logging (major routines, events, etc.) RTR_Logging= 2....Log all context structure changes RTR_Logging= 4....Log replications: attempted & performed RTR_Logging= 8....Log iterations through main polling loop RTR_Logging=16...Verbose debug logging RTR_Logging=32...Log all lock operations
Replica.Cluster.Docs.Added = 26790 Replica.Cluster.Docs.Deleted = 16060 Replica.Cluster.Docs.Updated = 378378 Replica.Cluster.Failed = 30 Replica.Cluster.Files.Local = 83 Replica.Cluster.Files.Remote = 83 Replica.Cluster.Retry.Skipped = 222 Replica.Cluster.Retry.Waiting = 0 Replica.Cluster.SecondsOnQueue = 13 Replica.Cluster.SecondsOnQueue.Avg = 2 Replica.Cluster.SecondsOnQueue.Max = 3593 Replica.Cluster.Servers = 1 Replica.Cluster.SessionBytes.In = 160450213 Replica.Cluster.SessionBytes.Out = 824894460 Replica.Cluster.Successful = 13484 Replica.Cluster.WorkQueueDepth = 0 Replica.Cluster.WorkQueueDepth.Avg = 0 Replica.Cluster.WorkQueueDepth.Max = 4
Network_Sprayer_Address=*
Failover by Path
Useful to disable name checking after connect I just wished it did work better (not always works) DO_NOT_USE_REMEMBERED_ADDRESSES=1
Normally, you should NOT get it What you should get are mostly by RepId It is a sign that you have multiple instances of the same replica id in one server You should (almost) never have duplicate SH DIR in the server tells you duplicates Requested to be added to ADMIN client next
Server_TransInfo_Normalize
Server_TransInfo_Range
default = 3000 Units is Miliseconds * 100 of std transaction 3000 is a BAAAAAAAAD default Fortunately Loadmon.ncf helps
to save old real times for all transactions
If you don't know better,

set between 10 and 40 default is 6 and is WAAAAAAAAY TOOO LOW
Alledgedly (rumour)
it helps also NON clustered HTTP servers Apparently some code in http checks SAI for self tuning, and a better SAI uses HW better
USE: AvailabilityIndexType=1 (for nonHTTP)
Tell CLREPL pause/resume

Show AI (in AIx but is different)

What should be seen is this; > show ai Range XF Hits Min AI Max AI
Useful to be able to read something If you are using a very high debug level
Remember to resume it, else you will get nuts trying to figure out what happened.
nconsole DOMPHU00 "sh ai" 1 2 48406 93 100 2 4 1380 77 93 3 8 1226 64 77 4 16 821 51 64 5 32 106 38 51 6 64 39 26 37 7 128 16 20 25 ...
Current value of SERVER_TRANSINFO_RANGE is 6. <<changes suggested for SERVER_TRANSINFO_RANGE>>
nconsole DOMPHU01 "sh ai " 1 2 48826 93 100 2 4 1052 77 93 3 8 1148 64 77 4 16 711 51 64 5 32 197 38 51 6 64 40 27 38 7 128 0 8 256 4 1 5 9 512 13 0 0 10 1024 11 0 0 11 2048 1 0 0 12 4096 1 0 0
Clustering: For Geeks... and for Normal People Too!

UPDATE SERVER
Q&A
Push
SERVER UPDATE
SUPPORT "EXTRA" MATERIAL

George Chiesa <chiesa@dotNSF.com> Daniel Nashed <nsh@nashcom.de>
REPLICA
Pull
sh Pu
VIEW DATA
ll Pu
DATA VIEW
DATABASE
DATABASE
UPDATE
SERVER
CLREPL (replica)
SERVER
UPDATE
Push
sh Pu
VIEW DATA DATA VIEW
These are the support pages... Which you can get by asking for them at the back of your business card... We politely request NO REPOSTING...
DATABASE
DATABASE
Mission Critical Service

The "Nines":
Much better defined by the

Total Cost of NOT HAVING IT when you need it
In other words, something that despite

having a (well known?) TCO may prove too much more significantly painful & expensive "NOT TO HAVE"
2 nines (99%) =circa= 88 hours/year 3 nines (99.9%) =circa= 9 hours/year 4 nines (99.99%) =circa= 52 minutes/year 5 nines (99.999%) =circa= 5 minutes/year
Downtime costs per user = [ (Total hours of Unscheduled downtime (25% of user population) X (Hourly user salary) + (Total hours of Scheduled downtime X Hourly Messaging Administrator Salary) ] / Number of messaging users NOTA BENE: R.S.E. and Change Management/Control needs
Keys: TOTAL costs of NOT having
Business Users do NOT care what you do with your PLANNED down time
Never begin asking for the budget...

as much as they care NOT to have ANY

UN-PLANNED down times during "biz time"
Business users can plan around PLANNED un-availability of mission critical sytems What Business Users can NOT usually accept
is having to have both Planned and UN-Pl'd YOU CAN NOT REDUCE BOTH TO ZERO on an individual component basis
ask for preference/aversion acceptable time of UNplanned downtime

against money to prevent them
Key: "individual component basis"
Have the user KEEP updated a contingency "Plan B" for alternative/manual processing, so they realise how much mission critical their system really is... TEST their plan B (fire drill :-) Ask again for the "TC of not Having" Ask again for "Not Having Aversion"
RunFaster=1 RunSafer=1 DoNotCrash=1 DoNotGetHacked=1 DoNotScrewMySLA=1 DoNotRuinMyBonus=1 DoNotGetMeSacked=1
Which of these do ACTUALLY EXIST ?
High Availability
My petty own TWO definitions
Historical = (ex-post) the FACT that a service has been available
in the past
Predicted = (ex-ante) a "PERCEPTION" in terms of Probability

that a service will be up when it will be needed in the future
KEY: do NOT extrapolate past availability
Strategic Planning:
My petty own definition (borrowed from many:-)

Analize possible future scenarios/events, their value and impact to you What can go wrong, and how much will it cost me/my entity NOT to have the service Estimate the "a priori" / "pari passu" probability of these events Analize, decide and take actions TODAY that will improve the probability of the desired events and scenarios actually happening
There is no such thing as "THE BEST" practice as absolute recipe

Does it make sense to ask ? Will the server be up tomorrow? NO SLA will make it happen...
at most you will get damages/penalties
Keyword of this slide is TODAY
It makes sense to Actively Plan & Design: WHAT CAN I DO TODAY to IMPROVE the probablity or likelihood that a Service will be perceived as available when needed?
The (pre) Works

The Works: Networking

You must apply generally agreed Best Practices

for making the individual items more reliable Examples: Clean your network of unwanted traffic Deploy Storage & IO sensibly, i.e.
http://www.Lotus.com/Performance
Apply standard tuning to OS and TCP DELETE every single other protocol you can PRINT and understand relevant KB notes Examples of TcpIp advised hacks:
EnablePMTUDiscovery=0 TcpTimedWaitDelay=30 etc
Automate the deployment customizations
Analyze your network and Investigate and Eliminate ALL non essential traffic
Domino and I/O Optimization

Don't do this! bottlenecks
Controller Channels OS Kernel Page file Notes executables Log files Domino data
single RAID5 volume
I/O controller
Domino and I/O Optimization (better)

Separate drive
OS Kernel Page file
Domino and I/O Optimization (even better)

Separate drive
OS Kernel Page file Controller Channels
Page OS Domino
bottlenecks
Controller Channels Notes executables Log files Domino data
RAID5 volume
bottlenecks
Notes executables Log files Domino data
I/O controller
I/O controller
RAID(1, 5) volume
Separate drive
OS kernel Page files Controllers Notes executables Log files
OS
Domino
Domino and I/O Optimization (much better)
Hardening HARDWARE Install and Post-upgrade script

Any/Everything in the box installed CAN fail If something is not installed it can not fail Physically remove from the boxes ANY hw not used
Modems, Audio, etc
I/O controller
Page \data
RAID(1, 5) volume
I/O controller
DISABLE everything you can't take out

Classic: lpt1, com1, com2, etc
bottlenecks
Apps, Domino I/O technology OS technology
RAID(1, 5) volume
\data
BOOT SEQUENCE: C, CD, A DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPS
MOST SW vulnerabilities are based on SW Bugs ALL software has (some) known + unknown BUGS If a software is not installed it can not run :-) If a software is not running its Bugs don't matter UNINSTALL everything you do not absolutely need Remove all un-needed online-documentation Win32: SPECIFICALLY
UNINSTALL WORKSTATION LAN SERVICES!!!
Hardening SOFTWARE Install and Post-upgrade scripts
Hardening SOFTWARE Install and Post-upgrade scripts (cont')

UNPLUG ALL NETWORK CABLES BEFORE UPGRADE, install from "safe" CDs, NEVER via LAN/WAN/etc After new Install, WindowsUpdate or equivalent
disable everything you do not need better yet, UNINSTALL what you do not need check what services are running / started / auto netstat -an | find "Listen" (check EACH) Beware of R.S.E. (Reverse Social Engineering)
Remote Management: do NOT mix/share

intranet security/passwords/domains/etc
DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPs
It's a wild world out there...

There is a lot of Win32 out there... online / aDSL! unpatched / running "Admin" Most Win32 patches require "reboots" Linux is as secure as senior the admins and viceversa, also true to the lower end Vulnerabilities (KNOWN and not) 13% of DNS servers have known vulnerabilities, according to ICANN PACE of change in OS patching levels External and "Internal" ScriptKiddies
Hard trends / environmental changes
Dilbertian Examples or WYPIWYG

IF you Pay people to keep the UPTIME
of individual machines (stress on individual)
They WILL schedule + preventative maint time

They will NOT apply patches a.s.a.p./available They will NOT down a service EVEN when at risk 99% of hacked/virused machines were "already well known vulnerabilities"
It will cost you much MORE money and troubles

and you will get LESS value for your money
SLAs are as useful to prevent damage as insurance/assurance [ :-) ]

High Availability
Make you feel better about evil things OUTCOMES

but they do NOTHING TO prevent evil things from happening in the first place
Something that is "likely" to be available... Must be architected and run as such "Architected" implies with "HEURISTICS",
most of which are "difficult to quantify" It's easier to measure Sq Feet of Grass to Mown than quantifying "Garden Landscaping Work"
Some "Dilbertian" examples:

I will insure my house in order for it NOT to go on fire, when you'd better
buy insurance in case of disaster BUT ALSO get a smoke detector (detection) get fire estinguishers (response !)
I will ask people to sign NDAs...
"Run" requires having meaningful WYPIWYG
The HUMAN Factor: WYPIWYG

SPOFs = Single Point of Failure

Definition:
A single point of failure is a anything that is not redundant enough and whose failure will cause damage to the availability of a service
WYPIWYG is actually W.Y.P.I.W.Y.G. "What You Print Pay Is What You Get"
I will NOT repeat here the trivial ones Some "hidden SPOFs":
check bill of materials for anything that has1 mouse/keyboard/Switch ==>IMPLY SAME RACK UPS/ISP/Site: you may have to consider multi site/homed
If you measure the wrong things... you WILL get wrong behaviours and outputs
The HUMAN Factor: WTPIWTD

The beauty of Notes/Domino: Secure Replication

Deploy to more than one site enabled by
Replication of databases scheduled replication event driven replication both
WTPIWTD is actually W.T.P.I.W.T.D. "What THEY Pay Is What THEY Demand"

Make sure the BizSponsor pays by BILLBACK a class of service with expected resilience a % of your fulfillment platform
Never let a user "own" a box that you run
easier to say than to do, but try :-)
Tips:
do NOT deploy by OS copy nor FTP, use replica Hardcode Cluster OU in ACLs ie. */Srv/<whatever> [Names]: Add to prevent pull replication issues
Credits:
From Lotus Operating Principles:

"Establish Purpose Before Action" as in Alice (In wonderland) Tell me Mr. Cat, which "Route" should I use? Cat: Where do you want to go ? Alice: Dunno, haven't figured that out yet! Cat: it does not matter which one you choose!
Our Teachers
Lotus/IBM/Iris: too many links, thanx to all ! Our Partners: Penumbra Partnering Inc. http://www.PENUMBRA.org Our Customers Some names in our site :-)
Who moved my cheese ?

MTBF = MEAN TIME BETWEEN (garanteed !!!) FAILURES

ALL the "answers"

are already out there somewhere
most, in the internet
Average of when you can expect something to fail Assumes eveything will eventually fail - by design!
MTBF implies P(F,eventually)=1.0
the VALUE question is

how to figure out WHAT ARE THE
RELEVANT QUESTIONS ?
Murphy's LAW ...and... Never Let a Machine Know You Need It :-)
It's uselful to define "relevant"

the "YOU ARE HERE" has changed
from "my Domino World" to "my Enterprise choices"
Please engrave in my tomb-stone:

The devil is in the variances to averages
PLAN AROUND UnPlanned Failures

Leverage on differences
reduce risk by using stuff that will fail eventually BUT with negative or zero correlation
Win32 code-streams have a huge in-built-correlation, so do UNIX's/Linux's Lower Correlation between Win32,Linux,etc Lower Correlation between AS400/iSeries / rest
you KNOW with a P(X fail,eventually)=1 that individual components = something = will fail (eventually) but you do not know WHEN, WHAT, HOW TRY to make cross-correlations work for you Don't forget Murphy's Law
Use this to weight how you "spread" stuff
Embedded Dis-Services
Manage measurables, the right ones

Anything having EITHER an MTBF, an SLA or windowsupdate.com or liveupdates has "Embedded individual outages" SLA implies Dis-Service agreement trade-offs The Business User does NOT care
for INDIVIDUAL SLAs/MTBFs So you could, can and must Architect and Design a CLUSTERed Solution and offer a CLUSTER SLA
If you measure & pay people for the cluster SLA

and "free" them from component's SLAs:
For Individual Machines/OS/HW/Components:

they will get downed to investigate/fix/update sooner, a.s.a.p. known vulnerabilities/problems + preventive maint made during prime time less dependencies on graveyard-shift work
The user will get

better and overall cheaper service less dis-service, and smoother/safer Operations Operators will match demand of services with + offer
Portfolio Principles
Testing Tips & Tricks:

my first SW manager taught me in 1980:
Design with Testing in mind; what you can not PROVE that works will either NOT work from day one but remain hidden until needed or fail in the future... Document the testing... for regression testing
"there is nothing wrong with putting all your eggs in one basket, just watch that basket" Henry Ford
don't put all your eggs in one basket cause you can't watch it close enough don't put all your eggs in too many baskets cause you can't watch them all close enough
A Fellow Penumbra told me: You do not need a boat, you need a friend who has one and knows how to use it....
Same for a protocol analyser: you just can NOT guess the client/server dialogue (ex caching)
WYPIWYG is actually W.Y.P.I.W.Y.G

High Availability
The art of doing something "automagically" to improve the perceived performance of the cluster, usually by making intelligent usage of idle resources. Proactive:
Load Spreading
"What You Print Pay Is What You Get"

If you measure the wrong things... you WILL get wrong behaviours and output
Co p g t 200 2 d tNS F I n c. yri h o ,
- Al r i ht s l g r eser ved P l asecont act N a t e dot SF +44 77185 7 673f or m e 8 or p esent t ns &i f or at n . . si ht t : / ot N com r a i o n m i o vi t p/ d SF. :
Reactive
Performning Load "re-"Balancing by trying to fail over to less busy clustermates

Cluster in Detail

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cluster in Detail

Hochgeladen von

Copyright:

Verfügbare Formate

Clustering: For Geeks... & for Normal People Too!

George Chiesa <chiesa@dotNSF.com> Daniel Nashed <nsh@nashcom.de>

This was not conceived at BL.uk

License: You have a limited license to this presentation.

This presentation is based upon empyrical info

We ALWAYS report security issues to IBM in private.

If you are using Reverse Proxies:

What is "Clustering for Geeks"

The 50/50 rule/s:

What we're covering today

50% of what you KNOW about clusters...

is quite useful !!!

Once upon a time... last millenium...

The STATE of the ART in 1995...

Server Configured in 1995...

This is the MOST controversial!

If I were you I would use...

YOU DO NOT NEED THE EXTRA LOGIC

K.I.S.S. (at the Notes/Domino Layer!!!)

Stay awake, more controversy to come...

Listen...(Bonus HACK): ( 42 443 )

This how I connect to my server

PORTS=TCPIP,TCPIP2 TCPIP=TCP,0,15,0,,45088, TCPIP_TCPIPADDRESS=0,0.0.0.0:1352 TCPIP2=TCP,0,15,0,,45088, TCPIP2_TCPIPADDRESS=0,0.0.0.0:443

HACK! How does that work?

Cluster Aware "1352" Notes Clients: a.k.a. Cluster-READY clients

If I told you Notes 4.01 was the first one...

Cluster.NCF (client side)

COMPLEX SET of design methodologies, techniques and heuristics

Cluster Examples: 3, 5 or 20+

The "i" in RAID stands for: In-Expensive

Clustering & Failover in Action

Server QUIT while reading...

Server Tasks involved

cldbdir takes care that cluster directory is up to date

replica should still be active as a fallback and to init replicas!

Server regularly check state of their Cluster Mates

Portfolio techiques / Sizing heuristics

API Level call NSPingServer

You can check this information via

There are always 2 practical limits:

Class of service: by "n" instances of resource

Almost Real Time Replication...

Say, for the purpose of example, you have "3"

a) we need to define how we will syncronize

With 3 elements you can define the following

Homework: Try the combinations for 4 units,

b) we still need to spread the load/access.

Nota benissimo: DO STOP AT 4 ! ! !

It's a Notes Database, similar to catalogue,

From LKB: How Push-Pull (std) Replica works

You can afterwards do:

Useful to OUT-of-service databases

From LKB: How Push Cluster Replica works !

How does Cluster Replication works (details)

Cluster Replicator Performance & Statistics

Check if clustering works properly via Show Stat Replica.Cluster.*

Show Stat Server.Cluster.*

How to restrict access (LKB 7002910)

SAI examples, un/touched

You may want to smooth this (or not)

Best Practices for Cluster Replication

Cluster Replication & Database Quotas

Notes Named Network & Directory Assistance