Sie sind auf Seite 1von 30

Talk – Data Quality Framework

Matthew Lawler

11 Nov 2019 Matthew Lawler lawlermj1@gmail.com 1


Data Quality Framework Walkthrough

• Data Quality Overview


• Data Quality Defined
• Data Quality Management
• Data Quality Flow
• Data Quality Measures
11 Nov 2019 Data Quality Framework - Matthew Lawler 2
DQ Overview: Addressing DQ
Quality data means data that meets stated
requirements. So, for any piece of data, we will
ask 3 questions:
1 Does the data support Agency outcomes?
2 How well does the data support these
Agency outcomes?
3 Can the data support these Agency
outcomes at Scale?

11 Nov 2019 Data Quality Framework - Matthew Lawler 3


DQ Overview: Fixing Data Rot
DQ issues will be tackled one at a time, using
the Kent Beck 3 stage rule:
‘First make it work,
then make it right,
then make it fast.’

11 Nov 2019 Data Quality Framework - Matthew Lawler 4


DQ Overview: States and Users
Does data meet Does data meet
Source
Requirements?
Target
Requirements?
• The input states are:
Alluvium, Blurry or Clear.
(Filtering)
Alluvium Bronze • The output states are:
Explorer Bronze, Silver or Gold.
Supplier
(Transforms)
• A data explorer is a pattern
Errors Blurry Silver discoverer.
Farmer
• A data farmer is an
information harvester.
• A data tourist is an
Clear Gold information browser.
• A data steward manages data
Data Steward
Tourist
assets.

11 Nov 2019 Data Quality Framework - Matthew Lawler 5


Data Quality Framework Walkthrough

• Data Quality Overview


• Data Quality Defined
• Data Quality Management
• Data Quality Flow
• Data Quality Measures
11 Nov 2019 Data Quality Framework - Matthew Lawler 6
DQ : What is Data Quality?
• Quality is the degree to which a set of
inherent characteristics fulfils requirements
(ISO 9000).
• That is, Quality data means data that meets
stated requirements.
• The data must be ‘fit for purpose’. Nothing
more and nothing less.

11 Nov 2019 Data Quality Framework - Matthew Lawler 7


How does DQ help Business Value?

1. Increases business value; that is, the ability of


the business to meet its goals.
2. Reduces business risks; that is, the chance
that the business will not meet its goals.
3. Improves business productivity; that is, the
efficiency with which the business to achieves
its goals.
4. Enhances business reputation; that is, how
well does the business influence external
parties.
11 Nov 2019
Data Quality Framework - Matthew Lawler 8
What is Data Rot?
Data that does not satisfy requirements has, by
definition, become disordered and allowed to collapse
into a state of decay or 'data rot'.
1. Disordered data produces accidental complexity
which negates data value.
2. Disordered data also increases data risks, such as
incomplete decision making, based on gut feel
without data.
3. Finally, disordered data cannot be optimised, by
definition.
Data Rot will be addressed by the 'First make it work,
then make it right, then make it fast' 3 stage approach.

11 Nov 2019 Data Quality Framework - Matthew Lawler 9


Poor DQ Examples
Data Rot Cause Data Value Data Risk Data Efficiency
Incomplete Record Keeping and Finding records often depends on Inability to find papers that are Bottleneck on individuals.
limited indexing finding the person responsible. the basis of policy decisions,
But what happens when the single points of failure, repeated
person leaves, or there is a re- work, poor decision making on
organisation? incomplete data. Search costs
are high.

Independent ETL Data is aggregated from sources Data quality is uneven because There is some efficiency already
by each division separately, using different testing is performed. as the divisions have discovered
different approaches and tools. There are also missed ways to solve this problem
opportunities to share independently, but it could be
experiences and costs. improved. Reporting staff are
more busy data cleansing rather
than data reporting.

Integration is fundamentally There is little common ground Missed relationships between Cannot be scaled as rules are
difficult between Environmental, Environmental, Economic and manual and defined in multiple
Economic and Sociological data. Sociological goals, so that the places.
The best integration data type is Agency has difficulty in achieving
geospatial data, which is the triple bottom line.
complex. Integration is done per
study.
No common Platform Each developer uses MS Access Rapid obsolescence means data Unable to scale due to 2Gig limit.
as default. becomes inaccessible.

11 Nov 2019 Data Quality Framework - Matthew Lawler 10


Data Quality Framework Walkthrough

• Data Quality Overview


• Data Quality Defined
• Data Quality Management
• Data Quality Flow
• Data Quality Measures
11 Nov 2019 Data Quality Framework - Matthew Lawler 11
DQ Mgt: Goals
1 Develop a governed approach to make
data fit for purpose based on requirements.
2 Define standards, requirements and
specifications for data quality.
3 Define and implement processes to
measure, monitor and report and data quality.
4 Continuously improve data quality,
through process and system improvements

11 Nov 2019 Data Quality Framework - Matthew Lawler 12


DQ Mgt: Plan-Deploy-Monitor-Act
Cycle

Data Data Data Management Data


Explorer Steward Working Group Owner

ACT PLAN

Data Farmer

MONITOR DEPLOY

Data Tourist Chief Data


Officer

11 Nov 2019 Data Quality Framework - Matthew Lawler 13


DQ Mgt: Phases
Phase Plan Deploy Monitor Act
Focus Governance Governance Operations Operations
Actors DMWG; DMWG; Divisions Staff; Divisions Staff;
Chief Data Office; Chief Data Office; Data Stewards (for guidance Data Stewards (for guidance
Divisions Management; Divisions Management; and advice) ; and advice) ;
Data Owner; Data Stewards Chief Data Office (for Chief Data Office (for
Data Stewards Divisions Staff escalation) escalation)

Purpose Assess Issues, evaluate Implement the Data Quality Active monitoring against Use tools and approaches to
alternatives and determine Plan to establish a business as defined thresholds, so that fix data quality issues, once
cost and impact of change ; usual capability to achieve data quality meets the thresholds have been
Plan to take on new data acceptable data quality business requirements exceeded.
collections;
Initiate and review of
upgrade of current data
collections;
Outcomes Defined data quality Agreed DQ improvement Known data quality levels More data quality measures
requirements, infrastructure, plan; Rollout/Training to measured against thresholds; remain within agreed defined
standards and thresholds ; Divisional staff ; Escalation process to Data business rules leading to
Data quality assessment ; Embed Data quality into Stewards increased data quality.
Data Quality Improvement operations
Plan ;
Reiterate PDMA cycle
Key Steps Define data quality Execute the plan; Measure data quality levels Use agreed methods and
requirements, infrastructure, Publish standards ; against defined business tools to fix data quality
standards and thresholds ; Implement infrastructure ; rules; issues;
Assess current data quality ; Train staff; Escalate to Data Stewards Do analysis and cleansing;
Create data quality Incorporate data quality into when needed; Negotiate with stakeholders
improvement plan ; normal operations Trend analysis on key metrics regarding quality levels;
Set up data quality PDMA Enforce the use of approved
process tools;
Review agreements and IT as
needed.

11 Nov 2019 Data Quality Framework - Matthew Lawler 14


DQ Mgt: Next Steps
1. Define data quality requirements,
infrastructure, standards and thresholds
2. Assess current data quality
3. Create data quality improvement plan
4. Set up data quality PDMA process

11 Nov 2019 Data Quality Framework - Matthew Lawler 15


Data Quality Framework Walkthrough

• Data Quality Overview


• Data Quality Defined
• Data Quality Management
• Data Quality Flow
• Data Quality Measures
11 Nov 2019 Data Quality Framework - Matthew Lawler 16
DQ Flow: Definitions
Term Definition
A data explorer is an information pattern discoverer. They are like an automotive engineer that designs
Data Explorer cars.

Data Farmer A data farmer is an information harvester. They are like a mechanic.

Data Quality Quality is the degree to which a set of inherent characteristics fulfils requirements (ISO 9000)

Data Rot The tendency of Data to undergo decomposition or decay, or to fall into chaos or disorder.

Data Tourist A data tourist is an information browser. They are like a vehicle driver.
Any transformation to the data to make the data fit defined requirements, separate from any intrinsic or
Enhancement extrinsic checking.

Data Owner A data owner has defacto title over a dataset.

Data Steward A Data Steward manages data assets on behalf of others and in the best interests of the organization.

Extrinsic checking Any filter applied to the data using an extrinsic predicate.
Extrinsic data quality This is a data characteristic that can be evaluated by looking at the data alongside some external
characteristic standard. That is, it is the combination of an external definition and the data itself.

Intrinsic checking Any filter applied to the data using an intrinsic predicate.
Intrinsic data quality This is a data characteristic that can be evaluated by looking at the data itself. That is, it is internal and
characteristic implied by the data (type) definition.

11 Nov 2019 Data Quality Framework - Matthew Lawler 17


DQ Flow: 2 Kinds of Data Quality
Kind Value Value

DQ States Alluvium, Blurry or Bronze, Silver or


Clear Gold
Requirements Source Provider Business
From
Direction Input Output

Function Filter Transform

Higher Order Map Reduce


Functions
11 Nov 2019 Data Quality Framework - Matthew Lawler 18
DQ Flow: High Level Flow
The data state moves from A to B to C; that
Does data meet Does data meet is from Alluvium (aka mud) to Blurry
Source Target (cloudy) to ‘Clear’ (Pure).
Requirements? Requirements?
This filtered and completed, but
untransformed and unenhanced, ‘Clear’
Alluvium Bronze
data is then good enough to be used for
Explorer transformation, reduction and reporting.
Supplier
For the Data Explorer, the bronze layer
provides keys that can be joined to integrate
Errors Blurry Silver
data across different source systems.
Farmer For the Data Farmer, the silver layer allows
their SQL skills to be used to prepare
customised reports.
Clear Gold
For the Data Tourist, the Kimball gold layer
Data Steward allows them to use a BI tool to explore
Tourist business questions, and provide easy to use
access for management.

11 Nov 2019 Data Quality Framework - Matthew Lawler 19


# Type DQ Flow: Data Quality States
O Name AKA Role Usable by Skills Data Process Outcome
Go, On or No
Schema

Initial analysis for


1 Input 1 Sandpit Explorer Scientist Insight decision. Simple Undefined
POC
Business Case
Defined
Manual steps; Requirements,
2 Output 2 Lead Explorer Scientist Insight Defined
little automation filtering and
enhancements.
Record of Receipt DQ Assessment.
Night, Source
3 Input 3 Alluvium Builder Developer Language and Acceptance. MOU and SLA
Mud Schema
No Filtering Checking
Data conforms
Source
4 Input 4 Blurry Cloudy Builder Developer Language Intrinsic filtering to types and
Schema
specification
Data can be Validated
5 Input 5 Clear Sunny Builder Developer Language Extrinsic filtering joined and Source
integrated Schema
Hubs and links
defined. Keys Data Vault
6 Output 6 Bronze Explorer Developer Language Hub transforms
can be matched Stage 1
across sources
Satellite Satellites Data Vault
7 Output 7 Silver Farmer Analyst SQL
transforms defined Stage 2
Facts and
BI Tool Facts/Dimension
8 Output 8 Gold Tourist Management Dimensions Kimball
user transforms
defined
11 Nov 2019 Data Quality Framework - Matthew Lawler 20
DQ Flow: Level 1 Flow ISO Standard
Manual Completeness
Sandpit Lead Referential
Discovery
Integrity
Security
Explorer Time

This fills in the


Alluvium
Intrinsic
Checks
Blurry
Extrinsic
Checks
Clear processes
Supplier between states.
Hard
Errors
B/W processes
?
Soft
Errors
are filters.
Data Steward
Colour
Clear
Data Vault/
Inmon Hubs Bronze processes are
Links
transforms.
Explorer
Data Vault/
Inmon Silver Sandpit is
Satellites
manual
Kimball
Facts Gold
Farmer
discovery.
Dimensions

Tourist

11 Nov 2019 Data Quality Framework - Matthew Lawler 21


DQ Flow: Maturity Decision Tree
Predicate Rules
Conditions Are the data rules documented? No Yes Yes Yes Yes Yes Yes Yes
Is the data load automated? No No Yes Yes Yes Yes Yes Yes
Are Intrinsic (data type) checks used to filter the
data? No No No Yes Yes Yes Yes Yes
Are Extrinsic (lookup) checks used to filter the
data? No No No No Yes Yes Yes Yes
Has the data been transformed into Hubs and
Links? No No No No No Yes Yes Yes
Has the data been transformed into Satellites? No No No No No No Yes Yes
Has the data been transformed into Facts and
Dimensions? No No No No No No No Yes
Classify Sandpit
data as Sandpit

Lead Lead

Alluvium Alluvium

Blurry Blurry

Clear Clear

Bronze Bronze

Silver Silver

Gold Gold
11 Nov 2019 Data Quality Framework - Matthew Lawler 22
DQ Flow: Maturity Conditional
Are the Requirements
for the data Yes
documented?

Is the data acquisition


Yes
largely manual?

Are Intrinsic (data type)


Yes
checks used to filter the
data?

Are Extrinsic (lookup)


Yes
checks used to filter the
data?

Has the data been


transformed into hubs Yes
No and links?
No
No Has the data been
No transformed into Yes
satellites?
No

Has the data been


No transformed into Facts
and Dimensions?
No

Yes

Sandpit Lead Alluvium Blurry Clear Bronze SIlver Gold

11 Nov 2019 Data Quality Framework - Matthew Lawler 23


Data Quality Framework Walkthrough

• Data Quality Overview


• Data Quality Defined
• Data Quality Management
• Data Quality Flow
• Data Quality Measures
11 Nov 2019 Data Quality Framework - Matthew Lawler 24
Extra
DQ Measures: Blurry
Argu
Dimension ment Description - General Example
If a checksum digit is provided on a field, then apply the check sum rule Each value passes the checksum
Check Digit None to ensure the value is valid. test.
The proportion of stored data against the potential of “100% complete” All mandatory (not null) columns
Completeness None Business rules define what “100% complete” represents. are populated.
If some control total provide, such as balance sum, then sum this column SUM(Balance) = Header Total
Control Total None across all records. balance
Check that all lat/longs are in the
Location None Locations should all be in the Agency’s area of responsibility. Boundary.
An exchange rate is known to the
Precision None Precision refers to the level of detail of the data element. 4th decimal place.
Use reasonableness to consider consistency expectations relevant within
Reasonablenes specific operational contexts. These are defined on the individual column Ranges on values Lat/Long within
s - Intrinsic None or type value level. Australia, no negative flows, etc.
If record count is provided in header, check that all records were
Record Count None received. Count(*) = Header count
How well does the type fit to an XML schema standard? When an
incoming message conforms to a standard, then this can be considered
Standards Fit - the intrinsic standard. In other cases, the standard may be more loosely Does that data fit with the XML
Intrinsic None applied. messaging standard?
For the primary key columns,
Essentially, uniqueness states that no entity exists more than once within ensure that there are no
Uniqueness None the data set. duplicate primary leys.
All codes are valid values. All
Validity refers to whether data instances are stored, exchanged, or dates are in required format, and
presented in a format that is consistent with the domain values, as well not string. No alphas in number
Validity None as consistent with other similar attribute values. fields.
11 Nov 2019 Data Quality Framework - Matthew Lawler 25
Extra
DQ Measures: Clear
Dimension Argument Description - General Example
The degree to which data correctly describes the “real world”
Functional object or event being described. In many cases measure
Dependen accuracy by how the values agree with an identified reference The identity of a person is known
Accuracy cy source of correct information. to an 80% probability.
Functional
Dependen Is there enough information in the record so that that record Is there enough Address info for a
Adequacy Check cy can be enhanced using some reference data? geo location?
Consistency refers to ensuring data values in one data set are
consistent with values in another data set. The concept of
consistency is relatively broad; it can include an expectation that
two data values drawn from separate data sets must not conflict
Another with each other, or define consistency with a set of predefined Set union/intersection beyond RI
Consistency value constraints. and Code values.
Privacy refers to the level of need for access control and usage Either separation of data into
Privacy monitoring. Some data elements require limits of usage or secure areas, or masking, or other
Privacy Policy access. approaches can be used.
Referential integrity is the condition that exists when all
Referential Other intended references from the data in one column of the table to No Foreign key values that are not
Integrity Table data in another column of the same or different table is valid. in the Parent table.
Standards Fit - How well does the type fit to an external type standard? The Does that data fit with the agreed
Extrinsic Standard rules may be more or less strict than the intrinsic type rules. XML standard?
Ensure that timestamping has
The degree to which information is current with the world that occurred, and that the data is
Time Variance it models. Data currency how “fresh” the data is, as well as captured in a Time variant
(Currency) Time correctness in the face of possible time-related changes. manner.
Timeliness refers to the time expectation for accessibility and Ensure that time gaps are
Timeliness Time availability of information. identified and followed up.
11 Nov 2019 Data Quality Framework - Matthew Lawler 26
DQ Measures: Bronze

Extra
Argume
Dimension nt Description - General Example

Use available record information and Use address data to derive the
Build new Keys reference data to create new identifiers Lat/Long.

Data Vault Hubs and Hub and Link data structures are needed to
Links enable joins across different source schemas. See Data Vault standard.

If data is structured in RDF triples,


then these may need to reified into
Data structures may need to be transformed relational form before reporting can
RDF Reification to make them usable. be done.
11 Nov 2019 Data Quality Framework - Matthew Lawler 27
DQ Measures: Silver

Extra
Argume
Dimension nt Description - General Example
Satellite data structures are
needed to easily compare
dependent data in the data vault
Data Vault Satellites model. See Data Vault standard.

When combining data from different


Data structures may need to be sources, some standardisation to a
Subject transformed to make them subject common subject oriented may be
Orientation oriented. needed.
11 Nov 2019 Data Quality Framework - Matthew Lawler 28
DQ Measures: Gold

Extra
Argu
Dimension ment Description - General Example

Fact and Dimension data structures


Facts and are needed for easy Business See Kimball
Dimensions Intelligence by the business end user. standard.

11 Nov 2019 Data Quality Framework - Matthew Lawler 29


Data Quality Framework Walkthrough

• Data Quality Overview


• Data Quality Defined
• Data Quality Management
• Data Quality Flow
• Data Quality Measures
11 Nov 2019 Data Quality Framework - Matthew Lawler 30

Das könnte Ihnen auch gefallen