Sie sind auf Seite 1von 49

GIS Data Quality

Producing better data quality through robust business processes


Kim Ollivier

BrightStar
TRAINING

Schedule Day One


Suggested breaks for the following times: Start: 9:00 Session 1 ( 90 min) Morning tea: 10:30 to 10:45 Session 2 ( 105 min) Lunch: 12:30 to 1:30 Session 3 ( 90 min) Afternoon tea: 3:00 to 3:15 Session 4 ( 105 min) Finish: 5:00 Each session will have an exercise or interactive discussion

Today

Introduction What causes poor quality

Lunch
Assessing Quality processes GIS upgrade project examples

Tomorrow

Metadata Designing rules

Lunch
Data warehouse and ETL Feature maintenance

Overview

Introduce yourself Your goals for this course?

Build a data quality system Avoid the worst traps Be able to describe a project scope
Budget, timeline, priorities

Sections of course based on

With permission from the author

ISBN 978-0-09771400-2

What is Data Quality?


If they are fit for their intended uses in operations, decision making and planning.

If they correctly represent the real-world construct to which they refer.

Spatial Accuracy

Statistical Accuracy

Completeness Score Accuracy Score Overall Score

= = =

Relevant Relevant + Missing Relevant - Errors Relevant Relevant - Errors Relevant + Missing

Completeness

LINZ Bulk Data Extract metadata\meta.html

Data Profiling

Find out what is there Assess the risks Understand data challenges early Have an enterprise view of all data

Profile Metrics

Integrity Consistency Completeness, Density Validity Timeliness Accessibility Uniqueness

Security

Confidentiality Possession Integrity Authenticity Availability Utility

Consistency

Discrepancies between attributes Exceptions in a cluster Spatial discrepancies

A GIS Data Quality System

Assess

Data Quality Assessment Data Profiling

Improve

Prevent

Recognise

Data Cleaning

Monitoring Data Integration Interfaces

Ensuring Quality of Data Conversion and Consolidation

Building Data Quality Metadata Warehouse

Monitor

Recurrent Data Quality Assessment

Course examples

LINZ coordinate upgrade 1998-2003 NSCC services upgrade 2008 Valuation roll structure and matching ETL of utilites from SDE to Autocad Address location issues NAR, DRA
Documents and examples on memory stick

Exercise 1: Nominate your database


Select a representative example dataset for later discussion You may be responsible for Or, you have to integrate Or, you have to load it Or, you supply it to others

Morning Tea

Assessing Quality
1.
2. 3. 4. 5. 6. 7.

Project steps Required roles Defining the objectives Designing rules Scorecard and Metadata Frequency of assessment Common mistakes

Processes Affecting Data Quality


Processes bringing data from outside Initial Data Conversion Processes causing data decay

Changes not captured


System Upgrades New Data Uses Loss of Expertise

System Consolidations
Manual Data Entry Batch Feeds Real-Time Interfaces

Database

Processes changing data from within Data cleaning

Process Automation

Data processing

Data purging

Outside: Initial Data Conversion

Define data mapping Extract, Transform, Load (ETL) Drown in Data Problems

Find Scapegoat

Outside: System Consolidation

Often from mergers (Auckland?)


Unplanned, unreasonable timeframes

Head-on two car wreck Square pegs into round holes Winner loser merging (50% wrong)

Outside: Manual Data Entry

High error rate Complex and poor entry forms Users find ways around checks Forcing non blanks does not work

Outside: Batch Feeds

Large volumes mean lots of errors Source system subject to changes Errors accumulate Especially dangerous if triggers activated

Outside: Real-Time Interfaces

Data between dbs in synchronisation Data in small packets out of context Too fast to validate Rejection loses record, so accepted Faster or better but not both!

Decay: Changes Not Captured

Object changes are unnoticed by computers Retroactive changes may not be propagated

Decay: System Upgrades

The data is assumed to comply with the new requirements Upgrades are tested against what the data is supposed to be, not what is actually there Once upgrades are implemented everything goes haywire

Decay: New Data Uses

Fitness to the purpose of use may not apply Acceptable error rates may now be an issue Value granularity, map scale Data retention policy

Decay: Loss of Expertise

Meaning of codes may change over time that only experts know Experts know when data looks wrong Retirees rehired to work systems Auckland address points were entered on corners and the rest guessed, later used as exact.

Decay: Process Automation

Web 2.0 bots automate form filling Transactions are generated without ever being checked by people Customers given automated access are more sensitive to errors in their own data

Within: Data Processing

Changes in the programs Programs may not keep up with changes in data collection Processing may be done at the wrong time

Special GIS Data Issues

Coordinate data not usually readable Data models CAD v GIS Fuzzy matching is not Boolean (near) Atomic objects harder to define Features have 2,3,4,5 dimensions Projection systems are not exact Topology requires special operators

Within: Data Purging

Highly risky for data quality Relevant data may be purged Erroneous data may fit criteria It may not work the next year

Within: Data Cleaning

En masse processes may add errors Cleaning processes may have bugs Incomplete information about data

Assessing Data Quality

Data profiling Interview users Examine data model Data Gazing

Data Gazing

Count the records Just open the sources and scroll Sort and look at the ends Run some simple frequency reports See if the field names make sense What is missing that should be there

Lunch

Data Cleaning

There are always lots of errors It is too much to inspect all by hand Data experts are rare and too busy It does not fix process errors You may make it worse

Automated Cleaning

The only practical method Needs sophisticated pattern analysis Allow for backtracking Data quality rules are interdependent

Common Mistakes
Inadequate Staffing of Data Quality Teams 2. Hoping That Data Will Get Better by Itself 3. Lack of Data Quality Assessment 4. Narrow Focus 5. Bad Metadata 6. Ignoring Data Quality During Data Conversions 7. Winner-Loser Approach in Data Consolidation 8. Inadequate Monitoring of Data Interfaces 9. Forgetting About Data Decay 10. Poor Organization of Data Quality Metadata
1.

Metadata
Includes everything known about the data

Data model Business rules, relations, state Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XML Readme.txt

Data Exchange

Batch or interactive ETL (Extract Transform Load) Replication Time differences in data

GIS in Business Processes

Integrates many different sources Spatial patterns are revealed Display thousands of records simultaneously with direct access Location now seen as important

Scorecard

Atomic Level Data Quality Information

Score Decompositions Intermediate Error Reports

DQ Score Score Summary

Case Study

Outline a GIS data quality system Measles Chart Prioritise Interview Build up a scorecard

Afternoon Tea

Assessment Exercise

Split into pairs Interview one person about their dataset Collect basic information Devise a strategy for a profile Rotate pair with another Interview other person Verbal reports to class

Major Upgrade Projects

LINZ Coordinate upgrade NSCC Coordinate upgrade

References

Data Quality Assessment Arkady Maydanchik

Das könnte Ihnen auch gefallen