Sie sind auf Seite 1von 39

ETL Process in Data Warehouse

G.Lakshmi Priya & Razia Sultana.A


Assistant Professor/IT

Outline

ETL
Extraction
Transformation
Loading

ETLOverview

ExtractionTransformationLoading ETL
Togetdataoutofthesourceandloaditintothedata
warehouse simplyaprocessofcopyingdatafromone
databasetoother
Dataisextracted fromanOLTPdatabase,transformed to
matchthedatawarehouseschemaand loaded intothedata
warehousedatabase
ManydatawarehousesalsoincorporatedatafromnonOLTP
systems suchastextfiles,legacysystems,andspreadsheets;
suchdataalsorequiresextraction,transformation,and
loading
WhendefiningETLforadatawarehouse,itisimportantto
thinkofETLasa process,notaphysical implementation

ETLOverview
ETLisoftenacomplexcombinationofprocessandtechnology
thatconsumesasignificantportionofthedatawarehouse
developmenteffortsandrequirestheskillsofbusiness
analysts,databasedesigners,andapplicationdevelopers
Itisnotaonetimeeventasnewdata isaddedtotheData
Warehouseperiodically monthly,daily,hourly
BecauseETLisanintegral,ongoing,andrecurringpartofadata
warehouse
Automated
Welldocumented
Easilychangeable

ETLStagingDatabase

ETLoperationsshouldbeperformedonarelational
databaseserverseparate fromthesourcedatabases
andthedatawarehousedatabase
Createsalogicalandphysicalseparationbetweenthe
sourcesystemsandthedatawarehouse
MinimizestheimpactoftheintenseperiodicETL
activityonsourceanddatawarehousedatabases

Extraction

Extraction

Theintegrationofallofthedisparatesystems acrossthe
enterpriseistherealchallengetogettingthedatawarehouse
toastatewhereitisusable
Dataisextractedfromheterogeneous datasources
Eachdatasourcehasitsdistinctsetofcharacteristicsthat
needtobemanagedandintegratedintotheETL systemin
ordertoeffectivelyextractdata.

Extraction

ETLprocessneedstoeffectivelyintegratesystemsthathave
different:

DBMS
OperatingSystems
Hardware
Communicationprotocols

Needtohavealogicaldatamap beforethephysicaldatacanbe
transformed

Thelogicaldatamapdescribestherelationship betweenthe
extremestartingpointsandtheextremeendingpointsofyourETL
systemusuallypresentedinatableorspreadsheet

Target
Table Name

Source
Column Name

Data Type

Table Name

Transformation
Column Name

Data Type

Thecontentofthelogicaldatamappingdocumenthasbeenproventobethecriticalelement
requiredtoefficientlyplanETLprocesses

Thetabletypegivesusourqueuefortheordinalpositionofourdataloadprocessesfirst
dimensions,thenfacts.

TheprimarypurposeofthisdocumentistoprovidetheETLdeveloperwithaclearcut
blueprintofexactlywhatisexpectedfromtheETLprocess.Thistablemustdepict,without
question,thecourseofactioninvolvedinthetransformationprocess

Thetransformationcancontainanythingfromtheabsolutesolutiontonothingatall.Most
often,thetransformationcanbeexpressedinSQL.TheSQLmayormaynotbethecomplete
statement

Theanalysisofthesourcesystemisusually
brokenintotwomajorphases:
Thedatadiscoveryphase
Theanomalydetectionphase

Extraction Data
DiscoveryPhase
DataDiscoveryPhase
keycriterionforthesuccessofthedata
warehouseisthecleanlinessandcohesiveness
ofthedatawithinit
Onceyouunderstandwhatthetargetneeds
tolooklike,youneedtoidentifyandexamine
thedatasources

DataDiscoveryPhase

ItisuptotheETLteamtodrilldownfurtherintothedatarequirementsto
determineeachandeverysourcesystem,table,andattributerequiredto
loadthedatawarehouse
CollectingandDocumentingSourceSystems
Keepingtrack ofsourcesystems
DeterminingtheSystemofRecord Pointoforiginatingofdata
Definitionofthesystemofrecordisimportantbecauseinmostenterprises
dataisstoredredundantly acrossmanydifferentsystems.
Enterprisesdothistomakenonintegratedsystemssharedata.Itisvery
commonthatthesamepieceofdataiscopied,moved,manipulated,
transformed,altered,cleansed,ormadecorruptthroughouttheenterprise,
resultinginvaryingversionsofthesame data

DataContentAnalysis
Extraction

Understandingthecontentofthedataiscrucialfordeterminingthebestapproach
forretrieval
NULLvalues. AnunhandledNULLvaluecandestroyanyETLprocess.NULLvalues
posethebiggestriskwhentheyareinforeignkeycolumns.Joiningtwoormore
tablesbasedonacolumnthatcontainsNULLvalueswillcausedataloss!
Remember,inarelationaldatabaseNULLisnotequaltoNULL.Thatiswhythose
joinsfail.CheckforNULLvaluesineveryforeignkeyinthesourcedatabase.When
NULLvaluesarepresent,youmustouter jointhetables
Datesinnondatefields. Datesareverypeculiarelementsbecausetheyarethe
onlylogicalelementsthatcancomeinvariousformats,literallycontaining
differentvaluesandhavingtheexactsamemeaning.Fortunately,mostdatabase
systemssupportmostofthevariousformatsfordisplaypurposesbutstorethem
inasinglestandardformat

Duringtheinitialload,capturingchangestodatacontentin
thesourcedataisunimportantbecauseyouaremostlikely
extractingtheentiredatasourceorapotionofitfroma
predeterminedpointintime.
Latertheabilitytocapturedatachangesinthesourcesystem
instantlybecomespriority
TheETLteamisresponsibleforcapturingdatacontent
changesduringtheincrementalload.

DeterminingChangedData

AuditColumns UsedbyDBandupdatedbytriggers
Auditcolumns areappendedtotheendofeachtabletostore
thedateandtimearecordwasaddedormodified
Youmustanalyzeandtesteachofthecolumnstoensurethat
itisareliablesourcetoindicatechangeddata.Ifyoufindany
NULLvalues,youmusttofindanalternativeapproachfor
detectingchange exampleusingouterjoins

DeterminingChangedData

ProcessofElimination
Processofeliminationpreservesexactlyonecopyofeach
previousextractioninthestagingareaforfutureuse.
Duringthenextrun,theprocesstakestheentiresource
table(s)intothestagingareaandmakesacomparisonagainst
theretaineddatafromthelastprocess.
Onlydifferences(deltas)aresenttothedatawarehouse.
Notthemostefficienttechnique,butmostreliablefor
capturingchangeddata

Determining Changed Data


InitialandIncrementalLoads
Createtwotables:previousloadandcurrentload.
Theinitialprocessbulkloadsintothecurrentloadtable.Sincechange
detectionisirrelevantduringtheinitialload,thedatacontinuesontobe
transformedandloadedintotheultimatetargetfacttable.
Whentheprocessiscomplete,itdropsthepreviousloadtable,renames
thecurrentloadtabletopreviousload,andcreatesanemptycurrentload
table.Sincenoneofthesetasksinvolvedatabaselogging,theyarevery
fast!
Thenexttimetheloadprocessisrun,thecurrentloadtableispopulated.
SelectthecurrentloadtableMINUSthepreviousloadtable.Transform
andloadtheresultsetintothedatawarehouse.

Transformation

Transformation

MainstepwheretheETLaddsvalue
Actuallychangesdataandprovidesguidance
whetherdatacanbeusedforitsintended
purposes
Performedinstagingarea

Transformation
DataQualityparadigm
Correct
Unambiguous
Consistent
Complete
Dataqualitychecksarerunat2places after
extractionandaftercleaningandconfirming
additionalcheckarerunatthispoint

Transformation CleaningData

AnomalyDetection
Datasampling count(*)oftherowsforadepartment
column

ColumnPropertyEnforcement

NullValuesinreqdcolumns
Numericvaluesthatfalloutsideofexpectedhighandlows
Colswhoselengthsareexceptionallyshort/long
Colswithcertainvaluesoutsideofdiscretevalidvaluesets
Adherencetoareqdpattern/memberofasetofpattern

Transformation Confirming

StructureEnforcement
Tableshaveproperprimaryandforeignkeys
Obeyreferentialintegrity

DataandRulevalueenforcement
Simplebusinessrules
Logicaldatachecks

Stop
Yes

Staged Data

Cleaning
And
Confirming

Fatal Errors

No Loading

Loading

LoadingDimensions
LoadingFacts

LoadingDimensions
Physicallybuilttohavetheminimalsetsofcomponents
Theprimarykeyisasinglefieldcontainingmeaningless
uniqueinteger SurrogateKeys
TheDWownsthesekeysandneverallowsanyotherentityto
assignthem
Denormalizedflattables allattributesinadimensionmust
takeonasinglevalueinthepresenceofadimensionprimary
key.
Shouldpossessoneormoreotherfieldsthatcomposethe
naturalkeyofthedimension

Thedataloadingmoduleconsistsofallthestepsrequiredto
administerslowlychangingdimensions(SCD) andwritethe
dimensiontodiskasaphysicaltableintheproper
dimensionalformatwithcorrectprimarykeys,correctnatural
keys,andfinaldescriptiveattributes.
Creatingandassigningthesurrogatekeys occurinthis
module.
Thetableisdefinitelystaged,sinceitistheobjecttobe
loadedintothepresentationsystemofthedatawarehouse.

Loadingdimensions

WhenDWreceivesnotificationthatan
existingrowindimensionhaschangeditgives
out3typesofresponses
Type1
Type2
Type3

Type1Dimension

Type2Dimension

Type3Dimensions

Loadingfacts

Facts
Facttablesholdthemeasurementsofan
enterprise.Therelationshipbetweenfact
tablesandmeasurementsisextremelysimple.
Ifameasurementexists,itcanbemodeledas
afacttablerow.Ifafacttablerowexists,itisa
measurement

KeyBuildingProcess Facts
Whenbuildingafacttable,thefinalETLstepisconvertingthe
naturalkeysinthenewinputrecordsintothecorrect,
contemporarysurrogatekeys
ETLmaintainsaspecialsurrogatekeylookuptableforeach
dimension.Thistableisupdatedwheneveranewdimension
entityiscreatedandwheneveraType2 changeoccursonan
existingdimensionentity
Alloftherequiredlookuptablesshouldbepinnedinmemory
sothattheycanberandomlyaccessedaseachincomingfact
recordpresentsitsnaturalkeys.Thisisoneofthereasonsfor
makingthelookuptablesseparatefromtheoriginaldata
warehousedimensiontables.

KeyBuildingProcess

LoadingFactTables

ManagingIndexes
PerformanceKillersatloadtime
Dropallindexesinpreloadtime
SegregateUpdatesfrominserts
Loadupdates
Rebuildindexes

ManagingPartitions
Partitionsallowatable(anditsindexes)tobephysicallydividedinto
minitables foradministrativepurposesandtoimprovequery
performance
Themostcommonpartitioningstrategyonfacttablesistopartition
thetablebythedatekey.Becausethedatedimensionispreloaded
andstatic,youknowexactlywhatthesurrogatekeysare
Needtopartitionthefacttableonthekeythatjoinstothedate
dimensionfortheoptimizertorecognizetheconstraint.
TheETLteammustbeadvisedofanytablepartitionsthatneedtobe
maintained.

OutwittingtheRollbackLog
Therollbacklog,alsoknownastheredolog,isinvaluablein
transaction(OLTP)systems.Butinadatawarehouse
environmentwherealltransactionsaremanagedbytheETL
process,therollbacklogisasuperfluousfeature thatmustbe
dealtwithtoachieveoptimalloadperformance.Reasonswhy
thedatawarehousedoesnotneedrollbackloggingare:

AlldataisenteredbyamanagedprocesstheETLsystem.
Dataisloadedinbulk.
Datacaneasilybereloadedifaloadprocessfails.
Eachdatabasemanagementsystemhasdifferentloggingfeaturesand
managesitsrollbacklogdifferently

Das könnte Ihnen auch gefallen