Sie sind auf Seite 1von 10

12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

OCDQBlog
ObsessiveCompulsiveDataQualitybyJim
Harris

Home Blog Pod ca st Videos Best of OC DQ Pu blished Articles About

Contact R SS

OCDQBlog
AdventuresinDataProling(Part1)
August03,2009JimHarris ObsessiveCompulsiveDataQuality(OCDQ)
isablogoeringavendorneutralperspective
ondataqualityanditsrelateddisciplines.

InmypopularpostGeingYourDataFreqOn,Iexplainedthat
understandingyourdataisessentialtousingiteectivelyand
Search
improvingitsqualityandtoachievethesegoals,thereissimplyno
substitutefordataanalysis.
SelectPostsbyTopic...
Iexplainedthebenetsofusingadataprolingtooltohelpautomate
someofthegruntwork,butthatyouneedtoperformtheactual
analysisandthenpreparemeaningfulquestionsandreportstoshare
withtherestofyourteam.

SeriesOverview

Thispostisthebeginningofavendorneutralseriesonthe
methodologyofdataproling.
JimHarrisistheOCDQBloggerinChiefand
afreelancewriter,professionalspeaker,
Inordertonarrowthescopeoftheseries,thescenariousedwillbethat
thoughtleader,andindependentconsultant.
acustomerdatasourceforanewdataqualityinitiativehasbeenmade
availabletoanexternalconsultantwhohasnopriorknowledgeofthe
dataoritsexpectedcharacteristics.Also,thebusinessrequirements
havenotyetbeendocumented,andthesubjectmaerexpertsarenot
currentlyavailable.

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 1/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

Theserieswillnotaempttocovereverypossiblefeatureofadata
prolingtooloreveneverypossibleuseofthefeaturesthatare
covered.Boththedataprolingtoolandthedatausedthroughoutthe
serieswillbectional.Thescreenshotshavebeencustomizedto
illustrateconceptsandarenotmodeledafteranyparticulardata
prolingtool.

TheAdventuresBegin...

Thecustomerdatasourcehasbeenprocessedbyadataprolingtool,
whichhasprovidedtheabovecountsandpercentagesthatsummarize
thefollowingeldcontentcharacteristics:

NULLcountofthenumberofrecordswithaNULLvalue
Missingcountofthenumberofrecordswithamissingvalue
(i.e.nonNULLabsenceofdatae.g.characterspaces)
Actualcountofthenumberofrecordswithanactualvalue(i.e.
nonNULLandnonmissing)
CompletenesspercentagecalculatedasActualdividedbythe
totalnumberofrecords
Cardinalitycountofthenumberofdistinctactualvalues
UniquenesspercentagecalculatedasCardinalitydividedbythe
totalnumberofrecords
DistinctnesspercentagecalculatedasCardinalitydividedby
Actual

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 2/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

Someinitialquestionsbasedonyouranalysisofthesestatistical
summariesmightincludethefollowing:

1.IsCustomerIDtheprimarykeyforthisdatasource?
2.IsCustomerName1theprimarynameontheaccount?Ifso,why
isntitalwayspopulated?
3.DothestatisticsforAccountNumberand/orTaxIDindicatethe
presenceofpotentialduplicaterecords?
4.WhydoestheGenderCodeeldhave8distinctvalues?
5.Dothe5distinctvaluesinCountryCodeindicateinternational
postaladdresses?

PleaseremembertheseriesscenarioYouareanexternalconsultant
withnopriorknowledgeofthedataoritsexpectedcharacteristics,who
isperformingthisanalysiswithouttheaidofeitherbusiness
requirementsorsubjectmaerexperts.

Whatotherquestionscanyouthinkofbasedonanalyzingthestatistical
summariesprovidedbythedataprolingtool?

InPart2ofthisseries:Wewillcontinuetheadventuresbyaempting
toanswerthesequestions(andmore)bybeginningouranalysisofthe
frequencydistributionsoftheuniquevaluesandformatsfoundwithin
theelds.Additionally,wewillbeginusingdrilldownanalysisin
ordertoperformamoredetailedreviewofrecordsofinterest.

RelatedPosts

AdventuresinDataProling(Part2)

AdventuresinDataProling(Part3)

AdventuresinDataProling(Part4)

AdventuresinDataProling(Part5)

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 3/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

AdventuresinDataProling(Part6)

AdventuresinDataProling(Part7)

GeingYourDataFreqOn

8 Comments 0 Likes
Data Quality, Methodology
Adventure In Data Proling, Data Proling

ADVENTURES IN DATA PROFILING


DATA
(PART
QUALITY:
... THE REALITY SHOW?

Comments(8) NewestFirst Subscribeviaemail

Preview PostComment

JimHarris 7yearsago

ThanksfortheadditionalquestionsTobyandPhil,

Goodeyefordetail,Toby.Iagreethatsomethingpotentiallyodd
seemstobehappeninginourStateAbbreviationeld.An
upcomingpartintheserieswillbedrillingdowntothose72
distinctvaluesandhopefullywewillbeabletoanswerallof
yourgreatquestions.Staytuned.

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 4/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

GreatquestionPhilfrontendvalidation(orsomeother
upstreamprocessbeforedatareacheswhereverwehappentobe
inthedataschainofcustody)thatreplacesinvalidvalueswith
NULLisacommonapproachtocleansingdata.

However,asBillInmonrecentlydiscussedinhisBeyeNetwork
articleSomePerspectivesonQuality:keepingincorrectdata
evenwhenitisknowntobeincorrectoftenprovidesvaluable
cluesastohowdataneedstobecorrected.

Therefore,evenwhensubstitutingNULLforinvalidvaluesis
thebusinessrulethatgetsimplemented,Iadvocateatleast
trackingwhattheinvalidvaluewasforbothauditingpurposes
andforsendinganupstreamrequesttopreventitfrom
happeningorotherwiseresolvetheissuebeforeitbecomesa
downstreamproblem.

However,asBillInmonsarticlealsopointsout,thereare
multipleperspectivesonwhatthebestapproachisand,aswith
somanyotherdataissues,thereisnoonerightanswer.

Asfortheseries,withinthecontextofthescenario,wewouldnt
knowbutIwillbreakmyownrules(justalile)andtellyou
thatnoprecleansinghasoccurredonanyoftheeldsinour
datasource.

However,thatdoesntmeanthatallisnecessarilywellwiththe
BirthDateeld:)

BestRegards

Jim

PhilA 7yearsago

Jim,

Thanksforthearticle,lookingforwardtotherest:)

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 5/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

MyquestionwouldbeaboutthetreatmentofthedataBEFORE
itgetstoyou.

Forexample,hasBirthDatealreadybeenvalidatedbythefront
endsystemtherebyrenderinginvaliddatesasnull.Hencethe
gureweseeisnot100%completebutyetnonearemissing?

Toby 7yearsago

Additionalquestions:

Whyarethere72distinctvaluesforStateAbbreviation?

Doesthisindicatedataqualityproblems?NonUSaddresses?
Misnamedcolumn?Overloading?

WhatvaluesarefoundinStateAbbreviation?

JimHarris 7yearsago

ThanksforyourfeedbackandquestionsVishbotharegreatly
appreciated.

Iplantopost2partsperweek.However,Imightstretchitouta
lileonafewofthepartsbecauseIwanttoleaveenoughtime
betweenpoststoallowforcomments.Thisisoneoftheprimary
reasonsthatIamdoingtheseriesviaablogandnota
whitepaperorapresentation.Ifeelthatthedialogueand
discussiongreatlyimprovethecontentandallowsomeofmy
pointstobemadeformebythosewhotakethetimetocomment
asopposedtomejusttellingeveryonehowIseetheworld.

Allofyourquestionsareexcellent.Icouldanswerthemallright
nowbutitwoulddefeatthegoaloftheseries.

Rememberthescenariowearepretendingtobeanexternal
consultantwithnopriorknowledgeofthedataoritsexpected

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 6/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

characteristicsobviouslythisisfareasierforyouandtherest
ofmyreaderssinceyoutrulydonthaveanyknowledgeofthe
datathatIamusing.

Manyofyourquestionswouldneedtobeposedtothosewriting
thebusinessrequirementsorthesubjectmaerexpertsforthis
datasource.Butagain,intheseriesscenario,wedonothave
accesstoanyoftheseresources.

Therefore,whattheseriesisreallyaskingis:Whatcanjustour
analysisofthedatatellusaboutit?

Dataprolingatthisstageoftheprocessisoftenmoreabout
formulatingagrowinglistofmeaningfulquestionsthanabout
ndinganswersonourown.Iwouldarguethatthisisnota
wasteofourtimeandinfact,isanecessaryaspectofperforming
comprehensivedataanalysis.

IhavebeenonmanyprojectswhereIwastoldthatIshould
delaydataprolinguntilbusinessrequirementsandsubject
maerexpertsareavailable.

IalwaysdisagreebecauseIthinkthatIcandoabeerjobof
evaluatingthebusinessrequirementsandpreparingformy
meetingswithsubjectmaerexpertswhenIhavespentsome
timelookingatthedatafromastartingpointofblissful
ignoranceandcuriosity.

Sonottobeevasive,Icanonlysaythatwemaybeableto
answersomeofyourquestionsbytheendoftheseriesbutwe
mayalsonotbeabletoanswersomeofthem.

Attheveryleast,wewillhaveastrongworkingtheorybefore
weengagetherestofteam.

BestRegards

Jim

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 7/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

VishAgashe 7yearsago

NicearticleJim.Iamlookingforwardtoallveparts...how
oftendoyouplantopostthem?

InadditiontothepointsmadebyIra,someofthequestions
whichcametomymindwereasfollows(basedonthestats
above):

1.Whatisthemodeinwhichthisorganization
communicates/touchesitscustomers?Lookslikesignicant
customersaremissingsomeinformation:Notallhavetelephone
numbers,notallhaveemails,notallhaveaddressetc...isit
importanttotouch/communicatewithcustomers?

2.Whatiscustomername1andcustomername2?Whythere
aretwonamescapturedandhowtheyareused?

3.Iwoulddenitelyliketoknowusageoffollowingelds.Since
noneoftheseeldshavevalues100%ofthetimebutthesehave
ahighActualcountitmakesmefeelthattheseeldswere
capturedwithsomeintent(andIwanttoknowthatintent):
BirthDate
GenderCode
TelephoneNumber
Email

VishAgashe

JimHarris 7yearsago

ThanksforyourfeedbackIra,

Idonotthinkthatanythingyouhavesuggestedisexcessive.

Icompletelyagreewithyouthatdataproling(especiallywith
largedatasets)canbeveryexpensiveintermsofprocessing

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 8/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

timeandthereforeyouwanttogetasmuchoutofthat
investmentasyoucan.Infact,thisistheverygruntworkthat
dataprolingtoolsautomateforyouandisoneoftheprimary
benetsforusingone.

Youcanassumethatalloftheresultsshownthroughoutthe
serieshavebeenproducedbytheinitialprocessingofthedata
sourcebythedataprolingtool.

Someoftheadditionalandextremelyusefulinformationthat
youhavedescribedwillinfactbeincludedinsubsequentparts
intheseries.However,someofitwillnotbeincludedonly
because,asIexplainedintheoverview,Iwillnotbeaempting
tocovereverypossiblefeatureofadataprolingtoolmainly
becauseIdontwanttheseriestobetoolongcurrentlyIam
planningon5parts.

BestRegards

Jim

IraWarrenWhiteside 7yearsago

Nicejob,howeverIwouldliketosuggestinadditiontobasic
countsandpercentagesandsinceyouorthetoolhadtoscanthe
entireletogetyouresultsyoumayalsowanttoincludeafew
additionalresultsofthescansuchasdomainvalues(Min,Max,
Mode(mostpopularvalue))andtheirrespectivecountsplusthe
paerns(emailZZZ@ZZZ.ZZ,telephone99999999999etc...)
andcolumnmetadata(maxlength,minlength,datatype).

Thismayseemexcessive,howeverforlargedatasetsthescanis
expensiveandyouwanttogetasmuchoutofitasyoucan.

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 9/10
12/29/2016 AdventuresinDataProfiling(Part1)OCDQBlog

JimHarris 7yearsago

On,HenrikLiliendahlSrensencommented:

ZIPCodeandTaxIDareUnitedStates(US)speciceld
labels

AndIrespondedbyasking:

Doesthatguaranteethattheeldvalueswillbe?

AndHenriktweetedback:

Absolutelynot.P.S.:EvenOmikronusesZIPCodeinthe
Englishversion:(

Mynaluntweetedcommentary:

Verifyingdatamatchesthemetadatathatdescribesitisone
essentialanalyticaltaskthatdataprolingcanhelpuswith,
providingamuchneededrealitycheckfortheperceptionsand
assumptionsthatwemayhaveaboutourdata.

Home Blog Podcast Videos Bestof Published About Contact


OCDQ Articles

2016,JimHarris. PoweredbySquarespace

http://www.ocdqblog.com/home/adventuresindataprofilingpart1.html 10/10

Das könnte Ihnen auch gefallen