Sie sind auf Seite 1von 20

Clustering: For Geeks... & for Normal People Too!

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
UPDATE SERVER

Push
REPLICA

SERVER

UPDATE

George Chiesa <chiesa@dotNSF.com> Daniel Nashed <nsh@nashcom.de>

Pull

s Pu
VIEW DATA

ll Pu
DATA VIEW

DATABASE

DATABASE

UPDATE

SERVER

CLREPL (replica)

SERVER

UPDATE

Push

sh Pu
VIEW DATA DATA VIEW

DATABASE

DATABASE

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

This Presentation was not researched nor conceived at the British Library

This was not conceived at BL.uk


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

License: You have a limited license to this presentation.


Copyright 2000-2006 dotNSF and its' suppliers. This presentation is non exclusively LICENSED to you for internal usage within your own entity, company or organization . For fair-usage purposes, please quote the source as "Bubble-Bath Ideas presentation at DNUG 2006, by G. Chiesa and D. Nashed" We request this presentation NOT to be publicly reposted, please ! Public abstracts will be posted at http://dotNSF.com & http://nashcom.de

This is bubble-bath-ware!

Disclaimers: NO Proofs...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Ok, just one hack from a red book where I wrote something in...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

This presentation is based upon empyrical info


Observed behaviours, features, bugs, beyond... I can NOT prove many of the hypothesis here Please accept these pearls of wisdom "as is" Some of this information may be obsolete soon but it's useful to know what the state of art is

Download and get this redbook: SG24-7017 Lotus Security Handbook (2004)
Hint: firefox's "modify header" plugin extension (free)

We ALWAYS report security issues to IBM in private.


and no, we will not discuss security bugs (all fixed:-)

If you are using Reverse Proxies:


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

What is "Clustering for Geeks"


Clustering 101 (definitions/vocabulary)
Clustering For Geeks"is the art of using documented functionality and "stable observed behaviours" to "automagically" provide a better and cheaper servICE (not serVER) In some cases, thinking quite outside of the box pushing the product to the limits !

The 50/50 rule/s:


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

What we're covering today


60' version of a much longer workshop... what is called "1352 Native Clustering" Which pieces are client/server based How each major piece work "per se" How to make the puzzle work for you

50% of what you KNOW about clusters...

is quite useless !
50% of what you don't know about clusters

is quite useful !!!


Value Proposition 50%+50%=100%
50% of DDTs (Don't Do That!)s And 50% of DO this !

About questions...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Once upon a time... last millenium...


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

IT IS "OK"(not impolite)
To interrupt... to ASK questions... 'ala' easyjet... "within reason" :-)

100% of what you do not understand can, and WILL probably hurt you!

The STATE of the ART in 1995...


was THIN ethernet (ethernet 10 as in 10Mb) if you were an IBM SHOP, you had TR/4/16 Each adaptor had one and only one address And in 1995 LOTUS was already shipping
Clustering and Failover embedded in Notes 4.01 (at the time called NPN=Notes Public Networks)

We reserve the right to postpone the answers, but, when in doubt, raise hand!

So a LOT within Notes has a strong LEGACY. So, we're going to provoke your brain to think!

Server Configured in 1995...


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

This is the MOST controversial!

If I were you I would use...


JUST ONE TCPIP NOTES PORT
You can still have as many addresses You can still listen to 0.0.0.0 in notes.ini You can still have complex tcpip routing tables

YOU DO NOT NEED THE EXTRA LOGIC


of Notes trying to cope with Ethernet 10 and just one IP address per physical card.

K.I.S.S. (at the Notes/Domino Layer!!!)

Stay awake, more controversy to come...

Listen...(Bonus HACK): ( 42 443 )


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

This how I connect to my server


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

This time the answer is not 42 ;-) but instead: 443! You can specity what you are "listening to" You must understand netstat -an | find "LISTEN" If you bind addresses you will listen just that BUT You CAN specify "0.0.0.0" as a specific address! You can use this to listen to all addresses at a port
Example: You can set a notes server to also listen on NRPC to port 443 on 0.0.0.0 this is a useful hack when you are behind a proxy and want to access your home server and the proxy only allows access to ports 80 and 443 port 443 proxies use transparent "connect method"

When visiting customers Using http proxies and not allowing 1352 direct. If cust agrees to allow me to connect to my own server while at their premises...using their proxy

PORTS=TCPIP,TCPIP2 TCPIP=TCP,0,15,0,,45088, TCPIP_TCPIPADDRESS=0,0.0.0.0:1352 TCPIP2=TCP,0,15,0,,45088, TCPIP2_TCPIPADDRESS=0,0.0.0.0:443

HACK! How does that work?


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

In my server's Notes.ini
PORTS=TCPIP,TCPIP2 TCPIP=TCP,0,15,0,,45088, TCPIP_TCPIPADDRESS=0,0.0.0.0:1352 TCPIP2=TCP,0,15,0,,45088, TCPIP2_TCPIPADDRESS=0,0.0.0.0:443

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster Aware "1352" Notes Clients: a.k.a. Cluster-READY clients

Definition:
A Notes Client is said to be cluster-aware when it will perform custom logic to transparently and automatically fail-over from one server to another, upon server directive or LACK of reply

QUIZ:
what % of Notes Clients are CLUSTER Aware? hint: what was the first version of Cluster Aware Notes client?

Voila': I can connect using HTTP Proxy "transparent connect method" to 443

If I told you Notes 4.01 was the first one...

Cluster.NCF (client side)


Servers also use it to connect to other servers!
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Clustering
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Time=22/12/2001 14:26:46 (80256B2A:004F5AD8) Cluster/NotesWeb CN=Notes2/O=Notesweb CN=Notes1/O=Notesweb Time=03/01/2002 16:18:24 (80256B36:0059935B) TheConifers.com CN=dotNSF.TheConifers.com/O=TheConifers CN=Linux.TheConifers.com/O=TheConifers CN=WebSphere.TheConifers.com/O=TheConifers CN=Win2k.TheConifers.com/O=TheConifers CN=www.TheConifers.com/O=TheConifers

COMPLEX SET of design methodologies, techniques and heuristics


applied to "stuff" that you can use to "make" "n" things to be perceived as ONE
bigger/better & "more reliable"

The key words of this slide are "PERCEIVED as" NB: We're going to focus on
MultiPlatform SOFTWARE Clustering

Perspective...
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster Examples: 3, 5 or 20+


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

The "i" in RAID stands for: In-Expensive


In 1987, Patterson, Gibson and Katz at the University of California Berkeley, published "A Case for Redundant Arrays of Inexpensive Disks (RAID)" . This paper described various types of disk arrays, referred to by the acronym RAID. The basic idea of RAID was to combine multiple small, inexpensive disk drives into an array of disk drives which yields performance exceeding that of a Single Large Expensive Drive (SLED). Additionally, this array of drives appears to the computer as a single logical storage unit or drive.

Cluster.ncf: (default max 2 mates TIMES 20 clusters, LKB 185700: Cluster_Name_Cache_Size=n (notes.ini)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Clustering & Failover in Action

Server QUIT while reading...


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster Mates:
"Mate" is an industry NON-PC (non politically correct!) std term
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Server Tasks involved


cladmin Servertask in R5
takes care about administrative things (D6+ not in servertasks=, launched automatically)

Definition:
A cluster of something is composed of mates logically siblings among them (no master) Domino Wise, a Cluster Mate can be: Available (normal) (SAI>SAT) Busy (Server_Availability_Index <= Server_Availability_Threshold)
Tip: You CAN BUSY a server by setting SAT=100

cldbdir takes care that cluster directory is up to date


(D6+ not in servertasks=, launched automatically)

clrepl pushes changes to other replicas based on information from cluster directory
(D6+ not in servertasks=, launched automatically) logs periodically into replication log (manual: tell clrepl log)

Unavailable (or unreacheable/perceived as such) Restricted (Temp=1 or Perm=2) Invalid (never contacted)

replica should still be active as a fallback and to init replicas!

Server regularly check state of their Cluster Mates


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Portfolio techiques / Sizing heuristics


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

API Level call NSPingServer


gives back a list of cluster mates and the availability

You can check this information via


> show cluster Cluster Information Cluster name: nsh-cluster, Server name: nsh-dus-02/Srv/NashCom/DE Server cluster probe timeout: 1 minute(s) Server cluster probe count: 185 Server availability threshold: 0 Server availability index: 100 (state: AVAILABLE) Cluster members (2)... server: nsh-dus-02/Srv/NashCom/DE, availability index: 100 server: nsh-dus-01/Srv/NashCom/DE, availability: 42

There are always 2 practical limits:


Lower: at LEAST how many you need to reduce risk Upper: at MOST hoy many can you manage effectively Tip: Start with 3 or 4, fine tune afterwards but please
do NOT start with 2 or 6

Class of service: by "n" instances of resource


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Almost Real Time Replication...


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Say, for the purpose of example, you have "3"


"whatevers": OSs, Sites, Servers, Routers, ISPs say you name the 3 elements as A B and C

a) we need to define how we will syncronize


Bad News:
Scheduled replication not good enough... Some apps must be cluster aware enabled!

With 3 elements you can define the following


Classes of Service: Top, simultaneously present in A+B+C Middle, present in either: AB, AC or BC Single, present just in A or B or C

Good News:
NATIVE Event/Queue Driven = CLREPL = (aka Almost Real Time) Most apps will automatically work better

Homework: Try the combinations for 4 units,


C(4,4) + C(4,3) + C(4,2) + C(4,1)

b) we still need to spread the load/access.

Nota benissimo: DO STOP AT 4 ! ! !

ClDbDir
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

ClDbDir (contents)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

It's a Notes Database, similar to catalogue,


Cluster Specific (RepId depends on ClusterName)

Maintained by a server task of the same name It's in the Enterprise Edition of Domino Contains info about databases deployed in a cluster Is used by Notes/Domino Cluster Aware modules
to know where to push what (and what NOT to!!!) and for "failovers": a server finds resource elsewhere!

Like CATALOG, each server updates its OWN dbs BEWARE: 8192 maximun number of useful entries; you do NOT get a warning NOR Error message!

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Bonus Hack: Set Config Cluster_Admin_On=1 It also works IN NON Clustered servers!

From LKB: How Push-Pull (std) Replica works


UPDATE SERVER

You can afterwards do:


CL DEL filename (cluster delete) CL COPY source dest REPLICA CL OUT database (out of service) CL IN database (in service again(
both work but are only meaningful in clusters

Push
REPLICA

SERVER

UPDATE

Pull

sh Pu
VIEW DATA

Pu

ll
DATA VIEW

Useful to OUT-of-service databases


BEFORE adding an OLD server to a cluster useful for decomissioning an old server you HAVE to add a server to get it into the CLIENT's Cluster.NCF

DATABASE

DATABASE

From LKB: How Push Cluster Replica works !


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

How does Cluster Replication works (details)


UPDATE
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

UPDATE

SERVER

CLREPL (replica)

SERVER

Push

s Pu
VIEW DATA

h
DATA VIEW

Document changes are captured and trigger the cluster Replicator via a message queue Cluster Replicator reads message queue and pushes changes to other all other replicas in the cluster regardless of replication settings (aka almost "real time" replication)

DATABASE

DATABASE

CLREPL
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

ClRepl (cont'd)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

CLREPL is a server task It's an in-Memory QUEUE driven event replicator (REMEMBER BATH TUB !) that SHOULD push content
at most within 15 seconds - in average 7

ClRepl PUSHES content modified locally to all cluster mates containing replicas of the modified database Tips:
It PUSHES ignoring source ACL Check that the queue is not over filled Always schedule CLASS+1 of them
NB: CLREPL does NOT initialize "Replica Stubs" It also knows what YES/NOT to push Out Of Service (for quite obvious reasons) but also Pending Delete (cldbdir does final push, not clrepl !)

thus ClRepl is also sometime called RTR or "ALMOST" REAL TIME REPLICATOR
the KEY here is in "ALMOST"

ClRepl (cont'd)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster Replicator Performance & Statistics


General Rule: number of clrepl = cluster members "minus" 1
R5: servertasks=events4, repl, router, clrepl, clrepl, clrepl, ... D6: Cluster_Replicators=n My Tip, set to CLASS_OF_SERVICE PLUS one, not minus one, over schedule it and it's cheap, underschedule it and you will have problems!

ClRepl will keep an IN-memory queue It's a QUEUE, and can be overfilled It's in MEMORY and is NOT disk persistent THUS, also schedule normal replicas: Tips:
within reason, overschedulling pull replicas is not a huge issue, because the deltas are small i.e. Enabled Replica From */Srv/Whatever to <each>/Srv/Whatever, PULL, every 60 Mins Will make servers catch up fast, pulling at restart time.

Check if clustering works properly via Show Stat Replica.Cluster.*


Replica.Cluster.WorkQueueDepth should be "small", i.e. less than 10 Replica.Cluster.RetryWaiting should be also "small" i.e. less than 5 Replica.Cluster.Failed should be zero if possible (easy to say :-) Check the Max and Average Times in queue, should be < 10 seconds

TIP: SH ST REPLICA.CLUSTER.*Q*
(Daniel to explain detail stats)

Show Stat Server.Cluster.*


Server.Cluster.OpenRedirects.xxx.Unsuccessful = 0 check for unsuccessful redirects!

How to restrict access (LKB 7002910)


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

SAI examples, un/touched


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Domino server clusters have an optional workload balancing feature that lets you distribute the workload of heavily-used databases across multiple servers in a cluster. To distribute workload, you limit or restrict the work that a server can perform using the following settings in the NOTES.INI: Server_Availability_Threshold
This setting allows you to specify the maximum availability level beyond which the server attempts to redirect user requests to other servers in the cluster. A server's availability index is recalculated each minute and compared against any threshold you set. If the index falls below the server threshold, the server becomes BUSY. The Cluster Manager redirects access requests from a BUSY server to the servers in the cluster. When an attempt to redirect is unsuccessful, the user receives access to the BUSY server. Each time a redirection occurs, Notes generates a workload balancing event in the Notes log (LOG.NSF).

Server_MaxUsers
This setting specifies the maximum number of user sessions allowed on a server. When the server reaches this limit, the server goes into a MAXUSERS state. The Cluster Manager then attempts to redirect new user request to other servers in the cluster. To see how often requests are being redirected, check the LOG.NSF for failover events. If redirection of the user request is unsuccessful, the user receives a message, and is not allowed access to the server.

You may want to smooth this (or not)

Server_Restricted
This setting enables a server to deny new open database requests and places the server in a RESTRICTED state. Users who have active connections to databases retain their connections. The Cluster Manager attempts to redirect new requests to other servers in the cluster. When an attempt to redirect is unsuccessful, the user receives a message and is not allowed access to the server. For each redirection attempt, Notes generates a failover event in the LOG.NSF. Note: You can use the Server_Restricted setting for any Domino server. This setting is not restricted to clusters.

Best Practices for Cluster Replication


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster Replication & Database Quotas


There are issues with Database Quotas before R5.0.10 Good news:
New option in R5.0.10 CLREPL_OVERRIDE_QUOTAS=1 Domino 6 overrides quotas by default you get the old behavior with Clrepl_Obeys_Quotas=1 (DDT)

Ensure you have full manager access for LocalDomainServers as a Server group or better */Srv/Org as Manager of type Server in all ACLs.. I prefer hardcoding OUs to groups. Works always! Make sure all applications provide roles to give access to documents with reader fields (remember computed auth fields) Give Servers all rights and roles to "see" all documents Don't use replication formulas for clustered databases Have a scheduled replication in case some events in the clrep-queue get lost or the server is down... Add startup replication documents "from *" to ensure databases are up to date after server restart Schedule replication to the Name of the cluster instead of single server names (load balancing & failover)

Bad news:
If you already have this problem you need to delete replication history and CutOff Date to resolve existing replication problems Lotus Script can clear the replication history Set rep = db.ReplicationInfo , Call rep.ClearHistory() , Call rep.Save() But not remove the CutOffDate (in most cases not needed)

Notes Named Network & Directory Assistance


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Changes/Recommendations
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Customer was using Notes Named Networks (NNN) across WAN connection
Caused unintended traffic

Only local servers in the same NNN Use only local directories in (DA)
Used "*" to specify the local replica only (TN #1087708) Evaluating Extended Directory Catalog to further optimization Directory catalog could simplify working with external addresses and allow more flexibility

Directory Assistance (DA)


Multiple replicas of 4 Directories where used First Server in the list was a remote server in the same NNN in some cases! Changed configuration to use the local server only All servers had replicas of all directories One external directory had huge number of deletion stubs due to external company always reimporting the directory :-(

Avoid large number of changes in Domino directories


Less need to update views in Domino Directory Less deletion stubs Not the first time we have seen nightly complete delete/add import agents in customer environments

How to use NNNs (KISS)


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Other High Availability Tips


Domino 6/7 support multiple versions on one logical UNIX/Linux box
much easier update and coexistence of multiple releases and allows to have a easy to handle "go back" scenario

One for TCPIP (and one per Cluster Port )

Fault-Recovery
Maximize server availability Faster Server Restart after crash! Automatic collect NSDs for faster troubleshooting

Domino Server Availability Index (SAI or AI)


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

LoadMon
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Domino 6+ uses a new algorithm to calculate the workload of a server and the resulting AI
A number of customers reported unpredictable, alternating AI which caused Clustering to fail. Algorithm was enhanced in D6.0.2CF2 and additional notes.ini parameters have been introduced. But there is another bug that is hopefully finally fixed in D6.5.6 and D7.0.2! We traced AI at customer site Live Environment Test Environment with Server.Load

Domino 6/7 use a module called "LoadMon"


Routine calculating speed of current transactions, summarizes and compares them with previous intervals and minimum values (RunningAvgTime & MinAvgTransTime) unit: microseconds
OPEN_DB OPEN_NOTE CLOSE_DB DB_INFO_GET DB_REPLINFO_GET GET_OBJECT_SIZE READ_OBJECT GET_SPECIAL_NOTE_ID DB_READ_HIST DB_WRITE_HIST SERVER_AVAILABLE_LITE NIF_OPEN_NOTE

Expansion Factor (XF)


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

LoadMon Notes.ini Settings


SERVER_TRANSINFO_MAX (default 5 / max 60) number of statistics collections stored in LoadMon SERVER_TRANSINFO_UPDATE_INTERVAL (default 15) interval for statistics capturing & calculation SERVER_MIN_TRANS (default 5) minimum transactions needed for a statistic value to be valid SERVER_TRANSINFO_NORMALIZE (default 3000) SERVER_TRANSINFO_HTTP_NORMALIZE (12000) as far we found out used to initialize empty statistics (zero in loadmon.ncf) on startup in Domino 6

XF is calculated based on the performance values of current transactions in relation to minimum time for a transaction
It's the number of times the current transactions take longer than the minimum transaction time XF values for different transactions build a overall XF This XF is computed and converted into AI based on a Range to scale the XF (TN #1112352) Notes.ini Server_Transinfo_Range n is 6 by default and specifies the maximum Expansion Factor of a Domino Server. The XF is calculated 2 raised to the power n (64 by default)

Debugging LoadMon
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

se co DEBUG_LOADMON=1
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
Server.LoadMon.TransInfo.AI.Type = 0 Server.LoadMon.TransInfo.CurrentTransCount.CLOSE_DB = 3 Server.LoadMon.TransInfo.CurrentTransCount.DB_INFO_GET = 2 Server.LoadMon.TransInfo.CurrentTransCount.DB_READ_HIST = 0 Server.LoadMon.TransInfo.CurrentTransCount.DB_REPLINFO_GET = 5 Server.LoadMon.TransInfo.CurrentTransCount.DB_WRITE_HIST = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_NOTE_INFO = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_OBJECT_SIZE = 0 Server.LoadMon.TransInfo.CurrentTransCount.GET_SPECIAL_NOTE_ID = 0 Server.LoadMon.TransInfo.CurrentTransCount.NIF_OPEN_NOTE = 0 Server.LoadMon.TransInfo.CurrentTransCount.OPEN_DB = 3 Server.LoadMon.TransInfo.CurrentTransCount.OPEN_NOTE = 7 Server.LoadMon.TransInfo.CurrentTransCount.READ_OBJECT = 0 Server.LoadMon.TransInfo.CurrentTransCount.SERVER_AVAILABLE_LITE = 2 Server.LoadMon.TransInfo.HttpNormalize = 12000 Server.LoadMon.TransInfo.IntervalInSeconds = 15 Server.LoadMon.TransInfo.Max = 5 Server.LoadMon.TransInfo.MinAvgTransTime.CLOSE_DB = 58.1818181818182 46 statistics found Server.LoadMon.TransInfo.MinAvgTransTime.DB_INFO_GET = 119.875 Server.LoadMon.TransInfo.MinAvgTransTime.DB_READ_HIST = 210.666666666667 Server.LoadMon.TransInfo.MinAvgTransTime.DB_REPLINFO_GET = 88.5714285714286 Server.LoadMon.TransInfo.MinAvgTransTime.DB_WRITE_HIST = 240.2 Server.LoadMon.TransInfo.MinAvgTransTime.GET_NOTE_INFO = 110.235087719298 Server.LoadMon.TransInfo.MinAvgTransTime.GET_OBJECT_SIZE = 141.777777777778 Server.LoadMon.TransInfo.MinAvgTransTime.GET_SPECIAL_NOTE_ID = 93.333333333 Server.LoadMon.TransInfo.MinAvgTransTime.NIF_OPEN_NOTE = 1,031.4285714286 Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_DB = 429.166666666667 Server.LoadMon.TransInfo.MinAvgTransTime.OPEN_NOTE = 272.987714987715 Server.LoadMon.TransInfo.MinAvgTransTime.READ_OBJECT = 134.285714285714 Server.LoadMon.TransInfo.MinAvgTransTime.SERVER_AVAILABLE_LITE = 95.3333333 Server.LoadMon.TransInfo.MinTrans = 5 Server.LoadMon.TransInfo.Normalize = 3000 Server.LoadMon.TransInfo.Range = 15 Server.LoadMon.TransInfo.RunningAvgTime.CLOSE_DB = 214.333333333333 Server.LoadMon.TransInfo.RunningAvgTime.DB_INFO_GET = 172 Server.LoadMon.TransInfo.RunningAvgTime.DB_READ_HIST = 0 Server.LoadMon.TransInfo.RunningAvgTime.DB_REPLINFO_GET = 187 Server.LoadMon.TransInfo.RunningAvgTime.DB_WRITE_HIST = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_NOTE_INFO = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_OBJECT_SIZE = 0 Server.LoadMon.TransInfo.RunningAvgTime.GET_SPECIAL_NOTE_ID = 0 Server.LoadMon.TransInfo.RunningAvgTime.NIF_OPEN_NOTE = 0 Server.LoadMon.TransInfo.RunningAvgTime.OPEN_DB = 4,143 Server.LoadMon.TransInfo.RunningAvgTime.OPEN_NOTE = 738 Server.LoadMon.TransInfo.RunningAvgTime.READ_OBJECT = 0 Server.LoadMon.TransInfo.RunningAvgTime.SERVER_AVAILABLE_LITE = 104

debug_loadmon=1
Enables LoadMon Debugging, writes additional information to server console
07.10.2003 07:08:09 Loadmon: Domino AI = 100, XF = 1

And adds additional 46 statistics counters (server.loadmon.*) Can be captured locally or remotely via "show server" or statistics collection program. nstats servername or C-API NSFGetServerStats (...)

loadmon.ncf
loadmon.ncf in Domino data directory stores last information from loadmon before server is shutdown loaded on server start to initialize statistics counters

BEWARE LARGE OVERFLOW INTO NEGATIVE VALUES Quit, delete loadmon.ncf, restart server (do after upgrades!)

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

What did we find out?


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

AI in D6.0.1 without Optimizing of Loadmon

Domino 6.0.1 AIX 5.1 dropping AI

Listen...(HACK 2)
You need to understand which fields are
Listens (usually in specific tabs) HostNames that are NOT Listens for example:
you can tell domino that it's HTTP hostname is the name of something else even in a different machine urls will be created nicely

AI with default interval 15 sec and 5 sampling values does not always result in steady AI
we needed to find values which provide steady values for cluster-failover not to occur "randomly" or cause Ping-Pong effects reasonable time to reflect current workload in AI Standard interval and sampling 15*5 cover 45 seconds Interval 10 seconds with 20 sampling values cover 200 seconds Standard Server.Load Scripts do not help much because most transactions are not used in standard scripts

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

HACK 3: How to use clustering for server consolidation


Add ALL servers to ONE CLUSTER...
Make sure you have Dbs no more than 3 times SET the SAT of the OLD servers to 100 This will BUSY them out Users will LOADBALANCE to new servers for all NON ADMIN/Managers users Unless you forgot an app just in old servers because it will continue to access old servers

For example, to get this


Cluster name: DOMPMAC01, Server name: DOMAGP01/SRV/Customer Server cluster probe timeout: 1 minute(s) Server cluster probe count: 47191 Server cluster default port: * Server availability threshold: 100 Server availability index: 0 (state: BUSY) Server availability default minimum transaction time: 3000 Cluster members (11): Server: DOMPMA01/SRV/Customer, availability index: 81 Server: DOMPMA02/SRV/Customer, availability index: 78 Server: DOMPIN02/SRV/Customer, availability index: 65 Server: DOMPIN01/SRV/Customer, availability index: 63 Server: DOMMYP01/OLD/SRV/Customer, availability index: 0 Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 server: DOMOEP01/SRV/Customer, availability: BUSY server: DOMHEP01/SRV/Customer, availability: BUSY server: DOMCVP01/SRV/Customer, availability: BUSY server: DOMVGP01/SRV/Customer, availability: BUSY server: DOMAGP01/SRV/Customer, availability: BUSY

Fine tuning via SAI/SAT/Range


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

And when you turn off a server...


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster information: Cluster name: DOMPMAC01, Server name: DOMMYP01/SRV/Customer Server cluster probe timeout: 1 minute(s) Server cluster probe count: 62831 Server cluster default port: * Server availability threshold: 0 Server availability index: 28 (state: AVAILABLE) Server availability default minimum transaction time: 3000 Cluster members (11): Server: DOMPMA02/SRV/Customer, availability index: 79 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPMA01/SRV/Customer, availability index: 78 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPIN01/SRV/Customer, availability index: 64 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMPIN02/SRV/Customer, availability index: 39 )) SERVER_AVAILABILITY_THRESHOLD=5 Server: DOMMYP02/OLD/SRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0 Server: DOMMYP01/OLDSRV/Customer, availability index: 0 )) SERVER_TRANSINFO_RANGE=2 & SAT=0 server: DOMHEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMVGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMCVP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMAGP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100 server: DOMOEP01/SRV/Customer, availability: BUSY )) SERVER_AVAILABILITY_THRESHOLD=100

Remember to ignore the probes failures


if annoyed increase the period of the probe Server_Cluster_Probe_Timeout=1 (minute)

Dead server do not run cldbdir, thus (hack!)


In New servers' CLDBDIR DELETE manually ALL instances of DBs in the old servers

Failover by replicaID uses the new servers! CLREPL will NOT attempt to keep dead servers updated (EXTREMELY IMPORTANT!!!!!!!!)

You can keep old dead servers


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

To finally delete the server


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

In the cluster for reasonable long time BUT you must check the logs and
sh st replica.cluster.*q* You can't have lost transactions.. because CLDBDIR thinks the old servers are EMPTY but alive CL Manager will say once a minute they are unreacheable, which is what you want for AUTOMATIC user failover... over time...

use AdminP !!!

Other Caveats/Tips/Tricks:
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Always TEST failovers


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

You must make sure you edit the old servers' records in NAB to remove mail routing You do not want mail to be attempted to be routed via old dead servers You'd better do server decomission report
BEFORE turning them off... a machine turned off produces no reports
DO NOT remove old old server from cluster yet

with a TEST user ID that is


NON Administrator NON manager of apps databases

It is assumed that managers know


where they want to access dbs and will NOT attempt failover

if you test with ADMIN.id: will drive you MAD

Cluster Analysis
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Failover
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Cluster Analysis is a great feature to figure out about problems in your cluster
It's part of the Admin Client and (Server / Analysis / Cluster ...) Run it to find problems with ACL, Replication, not existing databases, ...

Definition:
Server Initiated due to reactive Load Balance or failures Client Initiated server is dead or perceived as dead requires client to know how to connect to cluster mates without server assistance! Tips: insert the address in name:
CN=<FullyQualifiedDomainName>/Whatever CN=194.196.39.11/Srv/LotusEmea/Net

Tips
Run it, print it and sign off all warnings you find Use FT Search to remove multiple

occurrences of similar or already fixed problems until DB is empty


Run Analysis again to see you addressed al problems

DEBUG_NOSTDOUT=1
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

DO NOT USE THIS PLEASE


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

If you leave debug parameters ON in prod


capture the debug in files debug_Outfile= and NOT in StdOut

DEBUG_RUN_AS_ROOT=1
it WILL allow you to run as root in UNIX/Linux it will NOT allow you later to run as non root unless you fix all the owners, permissions,etc of everything it created. (just DDT please!) Exception: Some custom restores required root
GET A NEW VERSION OF RESTORE TOOL

for performance reasons and also... for sanity of old 3rd party apps (&BACKUPs)

Replication Debugging
DEBUG_REPL=2 & DEBUG_REPL_ALL=1
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

sh st replica.cluster.*
(if you do not read the stats, why bother clustering?)
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Log_Replication: (not ORable, different values, -1 does not work!)


Log_Replication=0....No replication logging Log_Replication=1....Logs server replication events Log_Replication=2....Adds logging of replication activity at the database level Log_Replication=3....Adds logging of replication activity at the note level Log_Replication=4....Adds logging of replication activity at the field level Log_Replication=5....Adds summary logging

RTR_Logging: (Tip: You can OR (sum) these, i.e. 63 is a LOT!)


RTR_Logging= 1....Default level of logging (major routines, events, etc.) RTR_Logging= 2....Log all context structure changes RTR_Logging= 4....Log replications: attempted & performed RTR_Logging= 8....Log iterations through main polling loop RTR_Logging=16...Verbose debug logging RTR_Logging=32...Log all lock operations

Replica.Cluster.Docs.Added = 26790 Replica.Cluster.Docs.Deleted = 16060 Replica.Cluster.Docs.Updated = 378378 Replica.Cluster.Failed = 30 Replica.Cluster.Files.Local = 83 Replica.Cluster.Files.Remote = 83 Replica.Cluster.Retry.Skipped = 222 Replica.Cluster.Retry.Waiting = 0 Replica.Cluster.SecondsOnQueue = 13 Replica.Cluster.SecondsOnQueue.Avg = 2 Replica.Cluster.SecondsOnQueue.Max = 3593 Replica.Cluster.Servers = 1 Replica.Cluster.SessionBytes.In = 160450213 Replica.Cluster.SessionBytes.Out = 824894460 Replica.Cluster.Successful = 13484 Replica.Cluster.WorkQueueDepth = 0 Replica.Cluster.WorkQueueDepth.Avg = 0 Replica.Cluster.WorkQueueDepth.Max = 4

Network_Sprayer_Address=*
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Failover by Path
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Useful to disable name checking after connect I just wished it did work better (not always works) DO_NOT_USE_REMEMBERED_ADDRESSES=1

Normally, you should NOT get it What you should get are mostly by RepId It is a sign that you have multiple instances of the same replica id in one server You should (almost) never have duplicate SH DIR in the server tells you duplicates Requested to be added to ADMIN client next

Server_TransInfo_Normalize
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Server_TransInfo_Range
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

default = 3000 Units is Miliseconds * 100 of std transaction 3000 is a BAAAAAAAAD default Fortunately Loadmon.ncf helps
to save old real times for all transactions

If you don't know better,


set between 10 and 40 default is 6 and is WAAAAAAAAY TOOO LOW

Alledgedly (rumour)
it helps also NON clustered HTTP servers Apparently some code in http checks SAI for self tuning, and a better SAI uses HW better

USE: AvailabilityIndexType=1 (for nonHTTP)

Tell CLREPL pause/resume


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Show AI (in AIx but is different)


What should be seen is this; > show ai Range XF Hits Min AI Max AI

Useful to be able to read something If you are using a very high debug level

Remember to resume it, else you will get nuts trying to figure out what happened.

nconsole DOMPHU00 "sh ai" 1 2 48406 93 100 2 4 1380 77 93 3 8 1226 64 77 4 16 821 51 64 5 32 106 38 51 6 64 39 26 37 7 128 16 20 25 ...
Current value of SERVER_TRANSINFO_RANGE is 6. <<changes suggested for SERVER_TRANSINFO_RANGE>>

nconsole DOMPHU01 "sh ai " 1 2 48826 93 100 2 4 1052 77 93 3 8 1148 64 77 4 16 711 51 64 5 32 197 38 51 6 64 40 27 38 7 128 0 8 256 4 1 5 9 512 13 0 0 10 1024 11 0 0 11 2048 1 0 0 12 4096 1 0 0

Clustering: For Geeks... and for Normal People Too!


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.
UPDATE SERVER

Q&A
Push
SERVER UPDATE

SUPPORT "EXTRA" MATERIAL


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

George Chiesa <chiesa@dotNSF.com> Daniel Nashed <nsh@nashcom.de>

REPLICA

Pull

sh Pu
VIEW DATA

ll Pu
DATA VIEW

DATABASE

DATABASE

UPDATE

SERVER

CLREPL (replica)

SERVER

UPDATE

Push

sh Pu
VIEW DATA DATA VIEW

These are the support pages... Which you can get by asking for them at the back of your business card... We politely request NO REPOSTING...

DATABASE

DATABASE

Mission Critical Service


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

The "Nines":
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Much better defined by the


Total Cost of NOT HAVING IT when you need it

In other words, something that despite


having a (well known?) TCO may prove too much more significantly painful & expensive "NOT TO HAVE"

2 nines (99%) =circa= 88 hours/year 3 nines (99.9%) =circa= 9 hours/year 4 nines (99.99%) =circa= 52 minutes/year 5 nines (99.999%) =circa= 5 minutes/year
Downtime costs per user = [ (Total hours of Unscheduled downtime (25% of user population) X (Hourly user salary) + (Total hours of Scheduled downtime X Hourly Messaging Administrator Salary) ] / Number of messaging users NOTA BENE: R.S.E. and Change Management/Control needs

Keys: TOTAL costs of NOT having

Business Users do NOT care what you do with your PLANNED down time
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Never begin asking for the budget...


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

as much as they care NOT to have ANY


UN-PLANNED down times during "biz time"

Business users can plan around PLANNED un-availability of mission critical sytems What Business Users can NOT usually accept
is having to have both Planned and UN-Pl'd YOU CAN NOT REDUCE BOTH TO ZERO on an individual component basis

ask for preference/aversion acceptable time of UNplanned downtime


against money to prevent them

Key: "individual component basis"

Have the user KEEP updated a contingency "Plan B" for alternative/manual processing, so they realise how much mission critical their system really is... TEST their plan B (fire drill :-) Ask again for the "TC of not Having" Ask again for "Not Having Aversion"

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

RunFaster=1 RunSafer=1 DoNotCrash=1 DoNotGetHacked=1 DoNotScrewMySLA=1 DoNotRuinMyBonus=1 DoNotGetMeSacked=1

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Which of these do ACTUALLY EXIST ?

High Availability
My petty own TWO definitions
Historical = (ex-post) the FACT that a service has been available
in the past

Predicted = (ex-ante) a "PERCEPTION" in terms of Probability


that a service will be up when it will be needed in the future

KEY: do NOT extrapolate past availability

Strategic Planning:
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

My petty own definition (borrowed from many:-)


Analize possible future scenarios/events, their value and impact to you What can go wrong, and how much will it cost me/my entity NOT to have the service Estimate the "a priori" / "pari passu" probability of these events Analize, decide and take actions TODAY that will improve the probability of the desired events and scenarios actually happening

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

There is no such thing as "THE BEST" practice as absolute recipe


Does it make sense to ask ? Will the server be up tomorrow? NO SLA will make it happen...
at most you will get damages/penalties

Keyword of this slide is TODAY

It makes sense to Actively Plan & Design: WHAT CAN I DO TODAY to IMPROVE the probablity or likelihood that a Service will be perceived as available when needed?

The (pre) Works


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

The Works: Networking


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

You must apply generally agreed Best Practices


for making the individual items more reliable Examples: Clean your network of unwanted traffic Deploy Storage & IO sensibly, i.e.
http://www.Lotus.com/Performance

Apply standard tuning to OS and TCP DELETE every single other protocol you can PRINT and understand relevant KB notes Examples of TcpIp advised hacks:
EnablePMTUDiscovery=0 TcpTimedWaitDelay=30 etc

Automate the deployment customizations

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Analyze your network and Investigate and Eliminate ALL non essential traffic

Domino and I/O Optimization


Don't do this! bottlenecks
Controller Channels OS Kernel Page file Notes executables Log files Domino data
single RAID5 volume

I/O controller

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Domino and I/O Optimization (better)


Separate drive
OS Kernel Page file

Domino and I/O Optimization (even better)


Separate drive
OS Kernel Page file Controller Channels
Page OS Domino

bottlenecks
Controller Channels Notes executables Log files Domino data

RAID5 volume

bottlenecks
Notes executables Log files Domino data
I/O controller

I/O controller

RAID(1, 5) volume

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Separate drive
OS kernel Page files Controllers Notes executables Log files

OS

Domino

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Domino and I/O Optimization (much better)

Hardening HARDWARE Install and Post-upgrade script


Any/Everything in the box installed CAN fail If something is not installed it can not fail Physically remove from the boxes ANY hw not used
Modems, Audio, etc

I/O controller

Page \data
RAID(1, 5) volume
I/O controller

DISABLE everything you can't take out


Classic: lpt1, com1, com2, etc

bottlenecks
Apps, Domino I/O technology OS technology

RAID(1, 5) volume

\data

BOOT SEQUENCE: C, CD, A DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPS

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

MOST SW vulnerabilities are based on SW Bugs ALL software has (some) known + unknown BUGS If a software is not installed it can not run :-) If a software is not running its Bugs don't matter UNINSTALL everything you do not absolutely need Remove all un-needed online-documentation Win32: SPECIFICALLY
UNINSTALL WORKSTATION LAN SERVICES!!!

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Hardening SOFTWARE Install and Post-upgrade scripts

Hardening SOFTWARE Install and Post-upgrade scripts (cont')


UNPLUG ALL NETWORK CABLES BEFORE UPGRADE, install from "safe" CDs, NEVER via LAN/WAN/etc After new Install, WindowsUpdate or equivalent
disable everything you do not need better yet, UNINSTALL what you do not need check what services are running / started / auto netstat -an | find "Listen" (check EACH) Beware of R.S.E. (Reverse Social Engineering)

Remote Management: do NOT mix/share


intranet security/passwords/domains/etc

DOCUMENT AND REQUIRE PAPER SIGN OFF BY OPs

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

It's a wild world out there...


There is a lot of Win32 out there... online / aDSL! unpatched / running "Admin" Most Win32 patches require "reboots" Linux is as secure as senior the admins and viceversa, also true to the lower end Vulnerabilities (KNOWN and not) 13% of DNS servers have known vulnerabilities, according to ICANN PACE of change in OS patching levels External and "Internal" ScriptKiddies

Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Hard trends / environmental changes

Dilbertian Examples or WYPIWYG


IF you Pay people to keep the UPTIME
of individual machines (stress on individual)

They WILL schedule + preventative maint time


They will NOT apply patches a.s.a.p./available They will NOT down a service EVEN when at risk 99% of hacked/virused machines were "already well known vulnerabilities"

It will cost you much MORE money and troubles


and you will get LESS value for your money

SLAs are as useful to prevent damage as insurance/assurance [ :-) ]


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

High Availability
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Make you feel better about evil things OUTCOMES


but they do NOTHING TO prevent evil things from happening in the first place

Something that is "likely" to be available... Must be architected and run as such "Architected" implies with "HEURISTICS",
most of which are "difficult to quantify" It's easier to measure Sq Feet of Grass to Mown than quantifying "Garden Landscaping Work"

Some "Dilbertian" examples:


I will insure my house in order for it NOT to go on fire, when you'd better
buy insurance in case of disaster BUT ALSO get a smoke detector (detection) get fire estinguishers (response !)

I will ask people to sign NDAs...

"Run" requires having meaningful WYPIWYG

The HUMAN Factor: WYPIWYG


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

SPOFs = Single Point of Failure


Definition:
A single point of failure is a anything that is not redundant enough and whose failure will cause damage to the availability of a service

WYPIWYG is actually W.Y.P.I.W.Y.G. "What You Print Pay Is What You Get"

I will NOT repeat here the trivial ones Some "hidden SPOFs":
check bill of materials for anything that has1 mouse/keyboard/Switch ==>IMPLY SAME RACK UPS/ISP/Site: you may have to consider multi site/homed

If you measure the wrong things... you WILL get wrong behaviours and outputs

The HUMAN Factor: WTPIWTD


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

The beauty of Notes/Domino: Secure Replication


Deploy to more than one site enabled by
Replication of databases scheduled replication event driven replication both

WTPIWTD is actually W.T.P.I.W.T.D. "What THEY Pay Is What THEY Demand"


Make sure the BizSponsor pays by BILLBACK a class of service with expected resilience a % of your fulfillment platform
Never let a user "own" a box that you run
easier to say than to do, but try :-)

Tips:
do NOT deploy by OS copy nor FTP, use replica Hardcode Cluster OU in ACLs ie. */Srv/<whatever> [Names]: Add to prevent pull replication issues

Credits:
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

From Lotus Operating Principles:


"Establish Purpose Before Action" as in Alice (In wonderland) Tell me Mr. Cat, which "Route" should I use? Cat: Where do you want to go ? Alice: Dunno, haven't figured that out yet! Cat: it does not matter which one you choose!

Our Teachers
Lotus/IBM/Iris: too many links, thanx to all ! Our Partners: Penumbra Partnering Inc. http://www.PENUMBRA.org Our Customers Some names in our site :-)

Who moved my cheese ?


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

MTBF = MEAN TIME BETWEEN (garanteed !!!) FAILURES


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

ALL the "answers"


are already out there somewhere
most, in the internet

Average of when you can expect something to fail Assumes eveything will eventually fail - by design!
MTBF implies P(F,eventually)=1.0

the VALUE question is


how to figure out WHAT ARE THE
RELEVANT QUESTIONS ?

Murphy's LAW ...and... Never Let a Machine Know You Need It :-)

It's uselful to define "relevant"


the "YOU ARE HERE" has changed
from "my Domino World" to "my Enterprise choices"

Please engrave in my tomb-stone:


The devil is in the variances to averages

PLAN AROUND UnPlanned Failures


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Leverage on differences
reduce risk by using stuff that will fail eventually BUT with negative or zero correlation
Win32 code-streams have a huge in-built-correlation, so do UNIX's/Linux's Lower Correlation between Win32,Linux,etc Lower Correlation between AS400/iSeries / rest

you KNOW with a P(X fail,eventually)=1 that individual components = something = will fail (eventually) but you do not know WHEN, WHAT, HOW TRY to make cross-correlations work for you Don't forget Murphy's Law

Use this to weight how you "spread" stuff

Embedded Dis-Services
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Manage measurables, the right ones


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Anything having EITHER an MTBF, an SLA or windowsupdate.com or liveupdates has "Embedded individual outages" SLA implies Dis-Service agreement trade-offs The Business User does NOT care
for INDIVIDUAL SLAs/MTBFs So you could, can and must Architect and Design a CLUSTERed Solution and offer a CLUSTER SLA

If you measure & pay people for the cluster SLA


and "free" them from component's SLAs:

For Individual Machines/OS/HW/Components:


they will get downed to investigate/fix/update sooner, a.s.a.p. known vulnerabilities/problems + preventive maint made during prime time less dependencies on graveyard-shift work

The user will get


better and overall cheaper service less dis-service, and smoother/safer Operations Operators will match demand of services with + offer

Portfolio Principles
Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

Testing Tips & Tricks:


my first SW manager taught me in 1980:
Design with Testing in mind; what you can not PROVE that works will either NOT work from day one but remain hidden until needed or fail in the future... Document the testing... for regression testing

"there is nothing wrong with putting all your eggs in one basket, just watch that basket" Henry Ford
don't put all your eggs in one basket cause you can't watch it close enough don't put all your eggs in too many baskets cause you can't watch them all close enough

A Fellow Penumbra told me: You do not need a boat, you need a friend who has one and knows how to use it....
Same for a protocol analyser: you just can NOT guess the client/server dialogue (ex caching)

WYPIWYG is actually W.Y.P.I.W.Y.G


Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide. Copyright 2000-2006 by George Chiesa and dotNSF, Inc - ALL RIGHTS RESERVED It is kindly requested that this presentation is NOT publicly posted, see "license" slide.

High Availability
The art of doing something "automagically" to improve the perceived performance of the cluster, usually by making intelligent usage of idle resources. Proactive:
Load Spreading

"What You Print Pay Is What You Get"


If you measure the wrong things... you WILL get wrong behaviours and output
Co p g t 200 2 d tNS F I n c. yri h o ,
- Al r i ht s l g r eser ved P l asecont act N a t e dot SF +44 77185 7 673f or m e 8 or p esent t ns &i f or at n . . si ht t : / ot N com r a i o n m i o vi t p/ d SF. :

Reactive
Performning Load "re-"Balancing by trying to fail over to less busy clustermates

Das könnte Ihnen auch gefallen