Sie sind auf Seite 1von 11

ExperienceandLessonslearntfromrunningHigh

AvailabilityDatabasesonNetworkAttachedStorage
ManuelGuijarro,RubenGasparetalCERNIT/DES
CERNITDES,CH1211Geneva23,Switzerland
Manuel.Guijarro@cern.ch,Ruben.Gaspar.Aparicio@cern.ch
Abstract.TheDatabaseandEngineeringServicesGroupofCERN'sInformationTechnology
DepartmentsuppliestheOracleCentralDatabaseservicesusedinmanyactivitiesatCERN.In
ordertoprovideHighAvailabilityandeasemanagementforthoseservices,aNAS(Network
AttachedStorage)basedinfrastructurehasbeensetup.ItrunsseveralinstancesoftheOracle
RAC(RealApplicationCluster)usingNFS(NetworkFileSystem)asshareddiskspacefor
RACpurposesandDatahosting.ItiscomposedoftwoprivateLANs(LocalAreaNetwork),
onetoprovideaccesstotheNASfilersandasecondtoimplementtheOracleRACprivate
interconnect,bothusingNetworkBonding.NASfilersareconfiguredinpartnershiptoprevent
havingsinglepointsoffailureandtoprovideautomaticNASfilerfailover.

1.Introduction
This paper describes a NAS based infrastructure and its implementation using a Fabric
ManagementframeworksuchastheQuattoradministrationtoolkit.Italsocoversaspectsrelatedto
NASperformanceandmonitoringaswellasDataBackupandArchiveofsuchfacilityusingalready
existinginfrastructureatCERN.
ANASisthenamegiventodedicatedDataStoragetechnologythatcanbeconnecteddirectlytoa
ComputerNetworktoprovidecentralizeddataaccessandstoragetoheterogeneousnetworkclients.
OperatingSystemandothersoftwareontheNASunitprovideonlythefunctionalityofdatastorage,
dataaccessandthemanagementofthesefunctionalities.Severalfiletransferprotocolsaresupported
(NFS,SMB,etc).
BycontrasttoaSAN(StorageAreaNetwork),NASusesfilebasedprotocolswhereitisclearthatthe
storageisremote,andServernodesrequestaportionofanabstractfileratherthanadiskblock.SAN
isanarchitecturetoattachremotecomputerstoragedevicessuchasdiskarrays,tapelibraries,etcto
serversinsuchawaythat,totheOperatingSystem,thedevicesappearaslocallyattacheddevices.
SANstendtobeexpensiveandcomplexwhichmakesthemuncommonoutsidelargerenterprises.
Moreover,whenusingSANforOracleDatabases,itsrecommendedtouseOracleASM( Automatic
StorageManagement).ASMaddsanadditionallayerofcomplexity.

ThemainmotivationtosetupNASbasedinfrastructureforOracleDatabaseis:

ToprovidethefilesharingneededforOracleRAC.NASstoragepresentssomeadvantages
likethepossibilityofhavingremotestorageorthefactthatcommunicationcanbebasedon
standardprotocolssuchasNFS.
ToeaserelocationofserviceswithinServernodes
To use NAS specific features: Snapshots, RAID, Failover based on NAS partnership,
DynamicreSizingoffilesystems,RemoteSynctooffsiteNAS,etc
TouseEthernetratherthanFiberChannel(whichismoredifficulttomanageastheprotocol
ismorecomplexperseandextraparameterizationisrequiredlikedefiningzones,etc.)
ToeaseFileServerManagement:automaticfailurereportingtovendor,etc
TosimplifyadministrationofDatabasestorage(scalability,easinesstoadd/create/resizeNAS
volumes,etc.)

2.Topology
Thetopologyiscomposedofseveral(HPDL380G5)ServernodesrunningRedHatEnterprise
Linux4.Eachofthemisequippedwithtwodualcore2.33GHzIntelprocessors,8GBRAMand6
NICs (Network Interface Cards). Oracle RAC requires a private network interface between RAC
members.Thisisimplementedbyconnectingallservernodesto2networkswitches.The2NICs
connectedtothoseswitchesareaggregatedusingLinuxBondingandareseenasasinglenetwork
interface(withasingleIPaddress).
Inasimilarway,eachservernodeisconnectedto2networkswitchesthatreachtheNASfilersthat
areconnectedtotwoDiskShelves. AfifthNICof theserver is connectedtothe GPN(General
PurposeNetwork).Inthistopologythereisnosinglepointoffailureotherthantheswitchwhich
connectsallserverstotheGPN.AfailureofaservernodeisovercomebytheOracleRACsoftware
thatredirectsrequeststoanotherServernodewithinthesameRAC.

Figure1.NASbasedinfrastructuretopology.

2.1.Bonding
Networkbonding(alsoknownasporttrunking)consistsofaggregatingmultiplenetworkinterfaces
intoasinglelogicalbondedinterfacethatcorrespondtoasingleIPaddress.Thistechniqueallows
implementationofloadbalancing(i.e.usingmultiplenetworkportsinparalleltoincreasethelink
speedbeyondthelimitsofasingleport)and/orautomaticfailover(intheevenofanetworkinterface
failure,dataistransferredthroughothernetworkinterfacesinthesameBondingaggregate).Two
typesofbondingmodesareusedinthistopology:

Activebackupmode:OnlyoneNICinthelogicalbondedinterfaceisactive.AdifferentNIC
becomesactiveif,andonlyif,theactiveNICfails. Thebond'sMACaddressisexternally
visible on only one port (network adapter) to avoid confusing the switch. The primary
bonding option can be used to specify which NIC is the primary device (for example:
primary=eth3).ThespecifieddevicewillalwaysbetheactiveNICwhileitisavailable.
Onlywhentheprimarydeviceisofflinewillalternatedevicesbeused.Thisisusefulwhen
oneNICispreferredoveranother,e.g.,whenoneNIChashigherthroughputthananother.
ThelinkstateofeachNICismonitoredevery100milliseconds.

IEEE802.3adDynamiclinkaggregationmode:FortrunkingbetweenNASfilersandswitches
aswellasfortheconnectionbetweentheswitches.EachNASfilerisequippedwith8NICs
that are aggregated on 2 VIFs (Virtual Interfaces) of 4 NICs each. Each filer does load
balancingonthenetworktraffictransmittedovereachVIF.ThisisdoneusingtheIPbased

method,i.e.:theoutgoinginterfaceisselectedonthebasisoftheNASfilerandclientsIP
address.ItisalsopossibletouseMACbasedorRoundRobinmethodsinstead.
2.2.NetAppClusterConfiguration
EachNASfilerisconfiguredinNetAppClusterenabledmode,havingasapartneranotherNAS
filerwhichaccessesthesamesetofDiskShelvesandthesamesubnet.NASfilersuseacluster
interconnectstomonitoreachother.WhenaNASfilerfails,atakeoveroccurs,andthepartnerfiler
continuestoservethefailedfilersdata.Takeoveroperationscanbemanuallyinitiated.Thisallows
performingnondisruptiveNASfilersoftwareupgradesaswellasdiskstoragemaintenance.
The2VIFsineachNASfilerareconnectedtodifferentnetworkswitches.Theyareaggregatedina
2ndlevelVIFwhichactsinactivebackupmode,i.e.onlyoneofthe1 stlevelVIFsisactive.
3.NASfilerandDiskShelvessetup
EachpairofNASfilersisconnectedtoasharedsetofDiskShelvesviaFC(FiberChannel).Each
shelfisequippedwith14*146GBdisks(or14*300GBdisks).DataONTAP7.2.2istheoperating
systemusedinallNASfilers.
3.1.DiskAggregatesandVolume
Asingleaggregateiscreatedineachshellcontainingalldisksinit(13disks+1sparedisk).Itis
composedofasingleRAIDDP(RAIDDoubleParity).RAIDDPisNetApps implementationof
RAID6(providesfaulttolerancefromtwodrivefailures)thatusesdoubleparityfordataprotection.

Figure3.DiagramofaRAIDDP(DoubleParity).

RAIDDP improves standard RAID 6 performance due to the behavior of the storage controller
software.AllfilesystemrequestsarefirstwrittentothebatterybackedNVRAMtoensurethereisno
datalossshouldthesystemlosepower.Blocksareneverupdatedinplace,sowhenincomingwrite
operationsareperformed,writesareaggregatedandthestoragecontrollertriestowriteonlycomplete
stripesincludingbothparityblocks.RAIDDPprovidesbetterprotectionthanRAID1/0,andeven
enablesdiskfirmwareupdatestooccurinrealtimewithoutanyoutage.
EachaggregatecontainsseveralFlexVolvolumes (sometimescalledFlexibleVolume).Theseare
looselycoupledtoitscontainingaggregate,asopposedtotraditionalvolumeswhereeachaggregate
containedasinglevolume.SincetheFlexVolvolumeismanagedseparatelyfromtheaggregate,itis
possibletocreatesmallFlexVolvolumes(20MBorlarger),andincreaseordecreasetheirsizein
incrementsassmallas4KB.
AFlexVolvolumecanshareitscontainingaggregatewithotherFlexVolvolumes.Thus,asingle
aggregatecanbethesharedsourceofallthestorageusedbyalltheFlexVolvolumescontainedbythat
aggregate.
EachFlexVolvolumeiscreatedwithasmallsizeandtheautosizeoption,whichallowsautomatic
increaseofvolumesizeasthisonegrows.
3.2.NFSaccessoptions
ToNFSmountoptionsusedtomounttheNASfilervolumeswhereServernodeswriteOracleData
arethosesuggestedbythevendorinthiskindofRACconfigurations.Theseare:
mountorw,bg,hard,nointr,tcp,vers=3,actimeo=0,timeo=600,rsize=32768,wsize=32768
UsingthebgoptionmeansthataServernodewillbeabletofinishbootingwithoutwaitingforany
NAS Filer. The hard option minimizes the likelihood of data loss during network and server
instability,whilenointrdoesntallowfileoperationstobeinterrupt.ThetcpoptionforcesNFSto
useTCPprotocolandworkswellonmanytypicalLANswith32KBreadandwritesize.Thetimeo
optionisusedto RPCretransmissiontimeouts.Retransmissionisthemechanismbywhichclients
ensureaserverreceivesandprocessesanRPCrequest.Iftheclientdoesnotreceiveareplyforan
RPCwithinacertainintervalforanyreason,itretransmitstherequestuntilitreceivesareplyfromthe
server.Aftereachretransmission,theclientdoublestheretransmittimeoutupto60secondstokeep
networkloadtoaminimum.Usingtimeo=600isagooddefaultforTCPmounts.

3.3.Backup
AllOracleDatabaseshostedinNASfilersaswellasinternaldisksofservernodesarebackedupin
CERNs central network backup system, TSM (IBM Tivoli Storage Manager). RMAN (Oracle
RecoveryManager)isusedtobackupupOracleDatabasesusingTDPO(TSMDataProtectionfor
Oracle)client.TSMandTDPOsoftwareareinstalledandconfiguredusingQuattorcomponents.
DataONTAPprovidesSnapshotsofvolumes,whichareareadonlycopyofanentirevolumethat
protectsagainstaccidentaldeletionsormodificationsoffileswithoutduplicatingfilecontents.This

feature is used for backup purposes before Database maintenance operations and it is typically
precededbyaDatabaseshutdown.
4.Performance
NetApps provides SAN as well as NAS based solutions. In their Performance Report
(http://www.netapp.com/library/tr/3423.pdf), it is stated that the throughput with NFS NAS based
solutionsisslightlylowerthanthatofiSCSI(InternetSCSIprotocol)orFCP(FiberChannelProtocol)
basedsolutions.
Oracle Database performs very well with the NAS devices that had been tested and put into
production.ManytestswererunfromtheDatabase,insingleinstancemodeandincluster(RAC)
mode.
4.1.DirectI/O
OraclemakesuseofdirectI/OwithNFStoaccessDatabasefilesstoredonNetworkAttached
Storagedevices.DirectI/OavoidsexternalcachingintheOSpagecache.Moreover,itismuchmore
performing (for typical Oracle I/O Database workloads) than buffered I/O. It was tested that
performancedoubledwhengoingfrombufferedI/OtodirectI/O.Theonlynecessaryrequiredstepto
enabledirectI/OattheDatabaselevelistohavetheFILESYSTEMIO_OPTIONSparametersetto
directIO:

altersystemsetFILESYSTEMIO_OPTIONS=directIOscope=spfilesid='*';
4.2.Measurements
Astresstestprogramhasbeenwrittenthatsimulatestherandom"smallI/O"(typically8KBread
operations)performedbyOracleDatabase.AlltestsshowthateachdiskinaNAS(testperformed
withasetofaggregatessummingupatotalof12disks)providesamaximumofabout150I/O
operations per second and per device (with 32 threads, see Figure bellow). This equals to the
maximumofI/Ooperationsthatcanbeperformedinasinglediskwhendoingrandomoperations.The
disksusedwereof10000RPMFibreChanneltype.

Figure3.Operationspersecond/Threads
Withmorethan64threadsdisksstartsaturatingandthenumberofoperationsstartsdropping.
5.SoftwareInstallationandCustomization
AllServernodesareQuattor(http://cern.ch/quattor)managed.Quattorisasystemadministration
toolkitprovidingapowerful,portableandmodulartoolsuitefortheautomatedinstallation,
configurationandmanagementofclustersandfarmsrunningUNIXderivateslikeLinuxandSolaris.
AllinformationregardingServernodessoftwaresetupisstoredinacentralCDB(Configuration
Database).ThesetofalreadyavailableQuattorNCMcomponentsisenoughtofullyautomate
configuration(includingsetupforNetworkBonding,NFS,Firewall,KernelModules,etc)for
everythingregardingtheOperatingSystemlayer.Ontheotherhand,alotofworkhasbeendoneto
automateinstallationandconfigurationofOracleRDBMSandRACsoftware.
5.1.StandardDatabasesetup
ForeachOracleRACDatabase,fourvolumesarecreatedintheNASfiler.Criticalfilesarespread
across those volumes. They are mounted in each RAC member on four mount points with the
followingcontent:

/ORA/dbs00:loggingdirectorystructureandacopyofthecontrolfile
/ORA/dbs02:copyofthecontrolfile,copyofthevotingdiskandthearchiveredofiles
/ORA/dbs03:thespfile,acopyofthecontrolfile,acopyofthevotingdisk,datafilesanda
copyoftheregistry
/ORA/dbs04:acopyofthecontrolfile,acopyofthevotingdisk,acopyoftheregistry.This
volumeislocatedinadifferentaggregate,i.e.adifferentNASfilerandDiskShelf.

5.2.OracleSoftwareInstallation
TheOracleClusterwareandRDBMSsoftwarewerepackagedinanRPMformat.Filesincludedin
the RPM areobtainedusingOUI (Oracle Universal Installer) for a first software installation and
OraclecloningscriptisusedfordistributiononthedifferentRACnodes.DuetoOracleinstallation
procedures restrictions, these RPMs are only used for software installation and cannot serve for
securitychecksordependencyverifications.ElapsedtimeforthefullyautomatedRPMinstallationis
afewminutesandcanbecomparedtoagoodfractionofanhourusinginteractiveOracletools.
PackageRemoval(AKAUninstallationinOracleterminology)isalsomucheasier.
TheworkrequiredtobuildRPMsstartingfromOUIisnotnegligibleanditiscomparabletoasingle
manualinteractiveinstallationincludingapplyingofallpatchsetsaswellasisolatedpatches.
5.3.StandardRACConfiguration
TheQuattorCDBcontainsallinformationaboutwhatshouldbeinstalledandconfiguredonevery
RAC member as well as the configuration information for of each RAC instance. A Quattor
component reads the CDB toobtainall this configuration informationandthen it modifies RAC
configuration files and startup procedures accordingly. Finally, it starts the cluster ware and the
RDBMS.
6.Monitoring
TheLemonMonitoringSystem(http://cern.ch/lemon)isusedtomonitorallaspectsoftheServer
Nodesincludingresourceusage,systemerrors,etc.NoparticularLemonsensorhasbeendeveloped
forthisinfrastructuresincethealreadyexistingwidesetofsensorscoversmostofwhatisneeded.
Lemon is a server/client basedmonitoringsystem. On everymonitorednode, a monitoring agent
launchesandcommunicatesusingapush/pullprotocolwithsensorsthatareresponsibleforretrieving
monitoringinformation.Theextractedsamplesarestoredonalocalcacheandforwardedtoacentral
Measurement Repository using UDP or TCP transport protocol with or without
authentication/encryptionofdatasamples.Sensorscancollectinformationonbehalfofremoteentities
likeswitchesorpowersupplies.TheMeasurementRepositorycaninterfacetoarelationalDatabaseor
aflatfilebackendforstoringthereceivedsamples.Webbasedinterfaceisprovidedforvisualizingthe
data
6.1.NASMonitoring
NAS filers cannot be directly monitored using Lemon since it is not possible to install any
monitoringagentinthem.Ontheotherhand,OEM(OracleEnterpriseManager)providesintegrated
OEMConnectorsthatuseSNMPtomonitorNASfilersfromanOEMagentrunninginaservernode.
OEMmonitorsperformanceaswellasraisesalertsandprovideshistoryloginformation.

Figure3.OEMConnectorforNASmonitoring.

Additionally,DataONTAPoffersseveralAutoSupportoptionstoallowautomaticreportofhardware
problemstotheHardwareSupportCentre.ThisisdoneviaSMTP(itcanaswellbedoneviaHTTPS).
Since NAS filers are not connected to the generalpurpose network, SMTP communication goes
throughoneoftheServernodes.

7.Conclusion
Over 15 Oracle RAC instances are already working in production using this NAS based
infrastructure.ThisincludesRACservicesforvariousCASTOR2Stagers(forCMS,Atlas,Alice,
LHCb,etc)aswellasotherprojectsatCERN(Lemon,OEM)andsomeAISservices.
Asetofhardwarefailuretestshasbeenperformed.Testsincluded:

Poweringoffeachnetworkswitch
PoweringoffNASfiler

PoweringoffOracleRACmembernode
Unplugginginterconnectactivebondmembernetworkcable
UnpluggingNASfileractivememberVIFcable
UnpluggingFCcableconnectingNASfilerandDiskshelf
UnpluggingpowercableofaNASfiler
UnpluggingpowercableofDiskShelf
RemovingaDiskfromshelf

All these tests passed successfully and did not cause any service interruption. Also, several
maintenanceoperationswereperformedcausingnodowntime:

FailedDisksreplacement
NASfilerHA(HostAdapter)replacement
FilerOSupdateDataONTAPupgradeto7.2.2(rollingforward)
FailedPowerSupplyUnitreplacement(bothinServernodeandNASFiler)
EthernetNetworkcablereplacement
ServerNodeOSreinstallation
ServerNodesKernelupgrade

Asresultofthisverypositiveexperience,theNASbasedinfrastructurewillbeextended withthe
objectivetobaseallOracleRACinstallationsonthesametechnology.Inthisway,IT/DEShopesto
drasticallyreducetheburdenofhavingtoadministrateandmaintainthecurrentratherfragmented
setupbasedonavarietyofsuppliersandStoretechnologies(EMC,SunStorage,etc).Moreover,this
solutionavoidsthecomplexitypresentsinothersolutionslikeSANsystems,whichrequiresinstalling
OracleASMorusingavolumemanager.
Acknowledgements
TheauthorswouldliketothankallITDESgroupfortheirinvolvementinthedesignandsetupof
this NAS based Database infrastructure. We would like toparticularlythank Eric Grancher, Nilo
Segura Chinchilla, Artur Wiecek, Mats Mller, Johan Gudheim Hansen and Philippe Defert for
providingideasandfeedbackonthisproject.
References
[1]
[2]
[3]

BondingonLinux:http://linuxnet.osdl.org/index.php/Bonding
ORACLE10gRelease2:http://www.oracle.com/pls/db102/homepage
ONTAPdocumentation7.1.1.1

Das könnte Ihnen auch gefallen