Beruflich Dokumente
Kultur Dokumente
OCDQBlog
ObsessiveCompulsiveDataQualitybyJim
Harris
Contact R SS
OCDQBlog
AdventuresinDataProling(Part1)
August03,2009JimHarris ObsessiveCompulsiveDataQuality(OCDQ)
isablogoeringavendorneutralperspective
ondataqualityanditsrelateddisciplines.
InmypopularpostGeingYourDataFreqOn,Iexplainedthat
understandingyourdataisessentialtousingiteectivelyand
Search
improvingitsqualityandtoachievethesegoals,thereissimplyno
substitutefordataanalysis.
SelectPostsbyTopic...
Iexplainedthebenetsofusingadataprolingtooltohelpautomate
someofthegruntwork,butthatyouneedtoperformtheactual
analysisandthenpreparemeaningfulquestionsandreportstoshare
withtherestofyourteam.
SeriesOverview
Thispostisthebeginningofavendorneutralseriesonthe
methodologyofdataproling.
JimHarrisistheOCDQBloggerinChiefand
afreelancewriter,professionalspeaker,
Inordertonarrowthescopeoftheseries,thescenariousedwillbethat
thoughtleader,andindependentconsultant.
acustomerdatasourceforanewdataqualityinitiativehasbeenmade
availabletoanexternalconsultantwhohasnopriorknowledgeofthe
dataoritsexpectedcharacteristics.Also,thebusinessrequirements
havenotyetbeendocumented,andthesubjectmaerexpertsarenot
currentlyavailable.
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 1/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
Theserieswillnotaempttocovereverypossiblefeatureofadata
prolingtooloreveneverypossibleuseofthefeaturesthatare
covered.Boththedataprolingtoolandthedatausedthroughoutthe
serieswillbectional.Thescreenshotshavebeencustomizedto
illustrateconceptsandarenotmodeledafteranyparticulardata
prolingtool.
TheAdventuresBegin...
Thecustomerdatasourcehasbeenprocessedbyadataprolingtool,
whichhasprovidedtheabovecountsandpercentagesthatsummarize
thefollowingeldcontentcharacteristics:
NULLcountofthenumberofrecordswithaNULLvalue
Missingcountofthenumberofrecordswithamissingvalue
(i.e.nonNULLabsenceofdatae.g.characterspaces)
Actualcountofthenumberofrecordswithanactualvalue(i.e.
nonNULLandnonmissing)
CompletenesspercentagecalculatedasActualdividedbythe
totalnumberofrecords
Cardinalitycountofthenumberofdistinctactualvalues
UniquenesspercentagecalculatedasCardinalitydividedbythe
totalnumberofrecords
DistinctnesspercentagecalculatedasCardinalitydividedby
Actual
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 2/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
Someinitialquestionsbasedonyouranalysisofthesestatistical
summariesmightincludethefollowing:
1.IsCustomerIDtheprimarykeyforthisdatasource?
2.IsCustomerName1theprimarynameontheaccount?Ifso,why
isntitalwayspopulated?
3.DothestatisticsforAccountNumberand/orTaxIDindicatethe
presenceofpotentialduplicaterecords?
4.WhydoestheGenderCodeeldhave8distinctvalues?
5.Dothe5distinctvaluesinCountryCodeindicateinternational
postaladdresses?
PleaseremembertheseriesscenarioYouareanexternalconsultant
withnopriorknowledgeofthedataoritsexpectedcharacteristics,who
isperformingthisanalysiswithouttheaidofeitherbusiness
requirementsorsubjectmaerexperts.
Whatotherquestionscanyouthinkofbasedonanalyzingthestatistical
summariesprovidedbythedataprolingtool?
InPart2ofthisseries:Wewillcontinuetheadventuresbyaempting
toanswerthesequestions(andmore)bybeginningouranalysisofthe
frequencydistributionsoftheuniquevaluesandformatsfoundwithin
theelds.Additionally,wewillbeginusingdrilldownanalysisin
ordertoperformamoredetailedreviewofrecordsofinterest.
RelatedPosts
AdventuresinDataProling(Part2)
AdventuresinDataProling(Part3)
AdventuresinDataProling(Part4)
AdventuresinDataProling(Part5)
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 3/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
AdventuresinDataProling(Part6)
AdventuresinDataProling(Part7)
GeingYourDataFreqOn
8 Comments 0 Likes
Data Quality, Methodology
Adventure In Data Proling, Data Proling
Preview PostComment
JimHarris 7yearsago
ThanksfortheadditionalquestionsTobyandPhil,
Goodeyefordetail,Toby.Iagreethatsomethingpotentiallyodd
seemstobehappeninginourStateAbbreviationeld.An
upcomingpartintheserieswillbedrillingdowntothose72
distinctvaluesandhopefullywewillbeabletoanswerallof
yourgreatquestions.Staytuned.
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 4/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
GreatquestionPhilfrontendvalidation(orsomeother
upstreamprocessbeforedatareacheswhereverwehappentobe
inthedataschainofcustody)thatreplacesinvalidvalueswith
NULLisacommonapproachtocleansingdata.
However,asBillInmonrecentlydiscussedinhisBeyeNetwork
articleSomePerspectivesonQuality:keepingincorrectdata
evenwhenitisknowntobeincorrectoftenprovidesvaluable
cluesastohowdataneedstobecorrected.
Therefore,evenwhensubstitutingNULLforinvalidvaluesis
thebusinessrulethatgetsimplemented,Iadvocateatleast
trackingwhattheinvalidvaluewasforbothauditingpurposes
andforsendinganupstreamrequesttopreventitfrom
happeningorotherwiseresolvetheissuebeforeitbecomesa
downstreamproblem.
However,asBillInmonsarticlealsopointsout,thereare
multipleperspectivesonwhatthebestapproachisand,aswith
somanyotherdataissues,thereisnoonerightanswer.
Asfortheseries,withinthecontextofthescenario,wewouldnt
knowbutIwillbreakmyownrules(justalile)andtellyou
thatnoprecleansinghasoccurredonanyoftheeldsinour
datasource.
However,thatdoesntmeanthatallisnecessarilywellwiththe
BirthDateeld:)
BestRegards
Jim
PhilA 7yearsago
Jim,
Thanksforthearticle,lookingforwardtotherest:)
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 5/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
MyquestionwouldbeaboutthetreatmentofthedataBEFORE
itgetstoyou.
Forexample,hasBirthDatealreadybeenvalidatedbythefront
endsystemtherebyrenderinginvaliddatesasnull.Hencethe
gureweseeisnot100%completebutyetnonearemissing?
Toby 7yearsago
Additionalquestions:
Whyarethere72distinctvaluesforStateAbbreviation?
Doesthisindicatedataqualityproblems?NonUSaddresses?
Misnamedcolumn?Overloading?
WhatvaluesarefoundinStateAbbreviation?
JimHarris 7yearsago
ThanksforyourfeedbackandquestionsVishbotharegreatly
appreciated.
Iplantopost2partsperweek.However,Imightstretchitouta
lileonafewofthepartsbecauseIwanttoleaveenoughtime
betweenpoststoallowforcomments.Thisisoneoftheprimary
reasonsthatIamdoingtheseriesviaablogandnota
whitepaperorapresentation.Ifeelthatthedialogueand
discussiongreatlyimprovethecontentandallowsomeofmy
pointstobemadeformebythosewhotakethetimetocomment
asopposedtomejusttellingeveryonehowIseetheworld.
Allofyourquestionsareexcellent.Icouldanswerthemallright
nowbutitwoulddefeatthegoaloftheseries.
Rememberthescenariowearepretendingtobeanexternal
consultantwithnopriorknowledgeofthedataoritsexpected
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 6/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
characteristicsobviouslythisisfareasierforyouandtherest
ofmyreaderssinceyoutrulydonthaveanyknowledgeofthe
datathatIamusing.
Manyofyourquestionswouldneedtobeposedtothosewriting
thebusinessrequirementsorthesubjectmaerexpertsforthis
datasource.Butagain,intheseriesscenario,wedonothave
accesstoanyoftheseresources.
Therefore,whattheseriesisreallyaskingis:Whatcanjustour
analysisofthedatatellusaboutit?
Dataprolingatthisstageoftheprocessisoftenmoreabout
formulatingagrowinglistofmeaningfulquestionsthanabout
ndinganswersonourown.Iwouldarguethatthisisnota
wasteofourtimeandinfact,isanecessaryaspectofperforming
comprehensivedataanalysis.
IhavebeenonmanyprojectswhereIwastoldthatIshould
delaydataprolinguntilbusinessrequirementsandsubject
maerexpertsareavailable.
IalwaysdisagreebecauseIthinkthatIcandoabeerjobof
evaluatingthebusinessrequirementsandpreparingformy
meetingswithsubjectmaerexpertswhenIhavespentsome
timelookingatthedatafromastartingpointofblissful
ignoranceandcuriosity.
Sonottobeevasive,Icanonlysaythatwemaybeableto
answersomeofyourquestionsbytheendoftheseriesbutwe
mayalsonotbeabletoanswersomeofthem.
Attheveryleast,wewillhaveastrongworkingtheorybefore
weengagetherestofteam.
BestRegards
Jim
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 7/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
VishAgashe 7yearsago
NicearticleJim.Iamlookingforwardtoallveparts...how
oftendoyouplantopostthem?
InadditiontothepointsmadebyIra,someofthequestions
whichcametomymindwereasfollows(basedonthestats
above):
1.Whatisthemodeinwhichthisorganization
communicates/touchesitscustomers?Lookslikesignicant
customersaremissingsomeinformation:Notallhavetelephone
numbers,notallhaveemails,notallhaveaddressetc...isit
importanttotouch/communicatewithcustomers?
2.Whatiscustomername1andcustomername2?Whythere
aretwonamescapturedandhowtheyareused?
3.Iwoulddenitelyliketoknowusageoffollowingelds.Since
noneoftheseeldshavevalues100%ofthetimebutthesehave
ahighActualcountitmakesmefeelthattheseeldswere
capturedwithsomeintent(andIwanttoknowthatintent):
BirthDate
GenderCode
TelephoneNumber
Email
VishAgashe
JimHarris 7yearsago
ThanksforyourfeedbackIra,
Idonotthinkthatanythingyouhavesuggestedisexcessive.
Icompletelyagreewithyouthatdataproling(especiallywith
largedatasets)canbeveryexpensiveintermsofprocessing
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 8/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
timeandthereforeyouwanttogetasmuchoutofthat
investmentasyoucan.Infact,thisistheverygruntworkthat
dataprolingtoolsautomateforyouandisoneoftheprimary
benetsforusingone.
Youcanassumethatalloftheresultsshownthroughoutthe
serieshavebeenproducedbytheinitialprocessingofthedata
sourcebythedataprolingtool.
Someoftheadditionalandextremelyusefulinformationthat
youhavedescribedwillinfactbeincludedinsubsequentparts
intheseries.However,someofitwillnotbeincludedonly
because,asIexplainedintheoverview,Iwillnotbeaempting
tocovereverypossiblefeatureofadataprolingtoolmainly
becauseIdontwanttheseriestobetoolongcurrentlyIam
planningon5parts.
BestRegards
Jim
IraWarrenWhiteside 7yearsago
Nicejob,howeverIwouldliketosuggestinadditiontobasic
countsandpercentagesandsinceyouorthetoolhadtoscanthe
entireletogetyouresultsyoumayalsowanttoincludeafew
additionalresultsofthescansuchasdomainvalues(Min,Max,
Mode(mostpopularvalue))andtheirrespectivecountsplusthe
paerns(emailZZZ@ZZZ.ZZ,telephone99999999999etc...)
andcolumnmetadata(maxlength,minlength,datatype).
Thismayseemexcessive,howeverforlargedatasetsthescanis
expensiveandyouwanttogetasmuchoutofitasyoucan.
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 9/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog
JimHarris 7yearsago
On,HenrikLiliendahlSrensencommented:
ZIPCodeandTaxIDareUnitedStates(US)speciceld
labels
AndIrespondedbyasking:
Doesthatguaranteethattheeldvalueswillbe?
AndHenriktweetedback:
Absolutelynot.P.S.:EvenOmikronusesZIPCodeinthe
Englishversion:(
Mynaluntweetedcommentary:
Verifyingdatamatchesthemetadatathatdescribesitisone
essentialanalyticaltaskthatdataprolingcanhelpuswith,
providingamuchneededrealitycheckfortheperceptionsand
assumptionsthatwemayhaveaboutourdata.
2016,JimHarris. PoweredbySquarespace
http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 10/10