Sie sind auf Seite 1von 18

Beyond Data Glossary 101:

From Manual to Automated

Discovery

Matthew Lawler
Matthew Lawler

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

1

Introduction

This case study will walk through how I discovered a corpus of 5,000 words and 1,000 acronyms by parsing 200,000 Data Warehouse (DW) column names. The manually defined acronym list contained 530 acronyms, so this doubled the total. In addition, the Data Glossary term was also linked to the schema and the column name it

occurred in.

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

2

Why Listen?

For Data professionals, the creation of Data (Business)

Glossary is an important first step in managing data.

But

Are you confident that your Data Glossary is complete?

Are your Data Glossary terms used in any database?

Can you map the Data Glossary terms to your database columns, to check for usage gaps?

Can you separate Data Glossary acronyms and words?

Do you maintain the Data Glossary automatically, or are you struggling manually?

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

3

Who is this for?

For Staff who need

to understand common terms, especially new or transferred

staff.

For Business Analysts who need

to determine if systems support business goals and terms.

to resolve confusion between business areas.

to integrate across business areas.

For Data Modellers who need

to enforce more consistent design rules when generating DDL and SQL.

to improve design and development productivity.

to publish metadata for Business Analysts

to review business terms against current data models and databases.

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

4

Building a Corpus

A corpus is the full set of words used in the enterprise.

This is always specific to the enterprise.

This is mostly done by collecting definitions from Parliamentary Acts, Manuals, and Data Dictionaries.

But most words are common and obvious.

The valuable terms are unique terms, homonyms and

acronyms.

Acronyms are important as they are shortened terms of common phrases with shared meaning/semantics.

Common words are filtered out, so that only the acronyms and unique words are left.

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

5

Parsing

A parser is a function that takes text and builds a data structure.

The data structure can be a list of Phrases.

String -> [Phrase]

Worksites

Worksite

Workskill

Workstatus -> [Work, Status]

Works -> [Works]

Work -> [Work]

-> [Worksites]

-> [Worksite]

-> [Work, Skill]

Technical

This is a semi-automated power tool. It grew out of my use of excel macros and awk scripts to solve this problem.

This is a non-technical talk, so I will avoid any code review. Come to CanFP if you want to see the code.

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

6

Data Model

Phrase Type Authority Domain Phrase Snippet 2 Phrase Phrase 2 Name Name 2 Phrase Name
Phrase Type
Authority
Domain
Phrase
Snippet 2 Phrase
Phrase 2 Name
Name 2 Phrase
Name
17 Sep 2019
Matthew Lawler lawlermj1@gmail.com
Beyond Data Glossary 101

Input

Output

7

Phrase Type

A phrase type represents the type of word phrase. This can be Acronym, Contraction, Letter, Multiple Words, etc.

Type

Definition

Example

Acronym

is any word formed from the initial letters of a group of words

WIP for Work In Progress

AllPhrase

is the default type for a normal word. (AKA Lexeme)

Work

Contraction

is any shortened word with missing letters.

Yr for Year

Letter

is a single alphabetic character.

E

MultipleWords

is a phrase that consists of more than 1 word.

Workstatus for [Work, Status]

Number

is a single numeric character.

9

PastTense

is a phrase that occurs in the past.

Accrued

Plural

is a phrase that denotes quantity.

Activities

ProperNoun

is any name, such as an organisation, system name, etc.

Oracle

Term

is used for multiple word phrases that are almost a single phrase.

Macaddress

ZRubbish

is for misspellings and non standard contractions

Iadc

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

8

Domain

A Domain is like a namespace.

There should be no homonym (same spelling/sound

but different meaning) words in a Domain. But homonyms will occur from different domains.

Domain Type

Domain Name

PNI

Physical Network Inventory

IT

Information Technology

HR

Human Resources

Finance

Finance

Engineering

Engineering

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

9

Authority

Authority Type represents the 'Who' of phrases.

That is, which person or Org has defined this phrase.

This is very useful for defusing definition wars.

Authority

Authority Type

Comment

Internal

Adhoc

Any term used by the organisation without an external authority.

Wiki

Adhoc

Womb of Ignorance, Kraziness and Incomprehension

Oracle

Commercial Organisation

Kimball

Expert

AG

Government

Water Act

2007

Parliamentary Act

ANSI

Standards Organisation

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

10

Phrase

A phrase is a single word, or common multiword

phrase. A set of phrases is a Corpus of words.

Phrase

Phrase Type

Expansion

Domain

WIP

Acronym

Work In Progress

AllDomains

Work

AllPhrase

AllDomains

Yr

Contraction

Year

AllDomains

E

Letter

AllDomains

Workstatus

MultipleWords

[Work, Status]

AllDomains

 

9Number

AllDomains

Accrued

PastTense

AllDomains

Works

Plural

AllDomains

Oracle

ProperNoun

Organisation

Macaddress

Term

IT

Iadc

ZRubbish

ZDomain

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

11

Column Name

Main input of database names, including schema,

table and column. This can be extracted using

SQL from the metadata tables.

Schema

Table Name

ORD

Column Name

BIA_BA_CAL

A_APPOINTMENT_SUMMARY_T

1REGION_KEY

BIA_BA_CAL

A_MAX_WORK_ORDER_STATUS_HISTORY_V_TABLE

1WORK_ORDER_SK

BIA_BA_CAL

ARR_CONTRACT_VERSION_T

1ARR_CONTRACT_KEY

BIA_BA_CAL

ARR_CONTRACT_VERSION_T

7ROW_NATURAL_ID

BIA_BA_CAL

ARR_CONTRACT_VERSION_T

8EFFECTIVE_FROM_TS

BIA_BA_CAL

ARR_CONTRACT_VERSION_T

9EFFECTIVE_TO_TS

BIA_BA_CAL

ASSR_TASK_T

1TASK_ID

BIA_BA_CAL

ASSR_TASK_T

10INSTANCEID

BIA_BA_CAL

ASSR_TASK_T

12ROOTREQUESTINSTANCEID

17 Sep 2019

12

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

Snippet2Phrase

This is a simple mapping of Phrases to Snippets.

Each Phrase is a key value defined in Phrase In.

Snippets can be upper or lower case, or some mixed case. Cardinality = O(Phrase In) * 2

Phrase

Snippet

Accrued

ACCRUED

Activities

Activities

Activities

ACTIVITIES

E

E

Macaddress

Macaddress

Macaddress

MACADDRESS

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

13

Name2Phrase

For each Name, this shows the phrase list, and any unparsed string. Cardinality = O(Column Name) (e.g. 200,000). This shows examples of true and false positive parsing examples.

name2PhraseOutName

name2PhraseOutSnippetsFinal

? Note

ACTIONWHENCOMPLETE

[Action,When,Complete]

No Underscore, but still 0 works

ORDER_TOTAL_ELAPSED_DURATION_H

OURS_WH

[Order,Total,Elapsed,Duration,Hours,Wh] 0Underscore separator

EFFORTTRACKINGTOTALTIMESPENTHOU

RS

[Effort,Tracking,Total,Times,PE,NT,Hours] 1Need to add Timespent

INSTANTIATIONNUMBER

[Inst,Anti,At,IO,N,Number]

1Need to add Instantiation

NUMRETRIES

[Num,Ret,R,IES]

1Need to add retries

PARENTSIGNAL

[Parents,I,G,NA,L]

1Need to add parentsignal

SIMULATION_MESSAGE

[SI,M,UL,At,IO,N,Message]

1Need to add Simulation

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

14

Phrase2Name

For each Phrase, this shows all Names used.

Unused phrases are filtered.

Cardinality = O(Phrase In) (e.g. 6,000)

 

Doma Cou

 

Name Type

Expansion

in

nt

Used In Names

Note

 

parses without

CI

Acronym

Configuration Item

PNI

2[TASK_CI,SERVICECI]

underscore _

 

parses without

GUID

Acronym

Globally Unique ID

IT

2[PHASE_GUID,DETAILSAPPGUID]

underscore _

IES

Acronym

NBN

1[NUMRETRIES]

False parse

IO

Acronym

Input Output

IT

2[SIMULATION_MESSAGE,INSTANTIATIONNUMBER] False parse

 

Network Analyser/Not

NA

Acronym

Applicable

PNI

1[PARENTSIGNAL]

False parse

 

WW

[EFFORTTRACKINGTOTALTIMESPENTMINUTES,EFF

PE

Acronym

M

2 ORTTRACKINGTOTALTIMESPENTHOURS]

False parse

SI

Acronym

NBN

1[SIMULATION_MESSAGE]

False parse

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

15

Data Flow Diagram

Snippet 2 Phrase Name 2 Name Parse Snippet Name 2 Phrase Join Phrase Invert
Snippet 2
Phrase
Name 2
Name
Parse
Snippet
Name 2
Phrase
Join
Phrase
Invert

Output

Input

Phrase 2 Name
Phrase 2
Name

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

16

Demo

PhraseIn - 6,000 Phrases

ColumnName 500 Names

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

17

Thanks

This is Open source on Github at:

DW Dictionary defines an ISO 11179 that could use this

code. See:

Future? Extracting words from Documents. Grammatical rules + lexemes NLP - Natural Language Processing

rules + lexemes NLP - Natural Language Processing 1 7 S e p 2 0 1

17 Sep 2019

Matthew Lawler lawlermj1@gmail.com Beyond Data Glossary 101

18