Sie sind auf Seite 1von 50

INDEPTH Data Quality Workshop

Program and Curriculum

11-13 May 2010, Accra, Ghana

Course Facilitator : Dr Kobus Herbst


1 Workshop Objectives
1. Create a common understanding of data quality in the context of health
and demographic surveillance
2. Learn from the experience regarding data quality in the iShare initiative
3. Gain practical experience in measuring data quality in HDSS databases
4. Derive and agree on minimum data quality metrics for INDEPTH sites
5. Apply a minimum set of common data quality metrics to own HDSS
database
6. Discuss the form and content of site data quality improvement projects
and INDEPTH’s role in promoting such

1 Outcomes
1. Minimum set of INDEPTH Data Quality Metrics defined
2. Site data quality baselines established
3. Common outline and criteria for site data quality projects agreed to
4. Recommendation made for an INDEPTH Data Quality Assurance Program.

2
1 Program
Time Topic Presenter
Day 1 : 11 May 2010
8:00-9:00 Registration INDEPTH
Secretariat
9:00-9:30 Welcome and Introduction to Workshop Objectives INDEPTH
Executive
Course Facilitator
9:30-10:30 What is Data Quality? Course Facilitator
10:30- Tea Break
11:00
11:00- Impact of Data Quality on Demographic Measures Ayaga Bawah
11:30
11:30- Extend and Implications of Poor Quality Data – iShare iShare
12:00 Experience representative
12:00- Causes of Poor Quality Data Course Facilitator
12:30
12:30- Lunch Break
13:30
13:30- Measuring Data Quality : Theory Course Facilitator
14:30
14:30- Measuring Data Quality : iShare Experience iShare
15:30 representative
15:30- Tea Break
16:00
16:00- Measuring Data Quality : Practical - Attribute domain Course Facilitator
17:00 constraints
Day 2 : 12 May 2010
8:30-9:30 Measuring Data Quality : Practical – Relational integrity Course Facilitator
constraints
9:30-10:30 Measuring Data Quality : Practical – Historical Data & Course Facilitator
State Dependant Objects
10:30- Tea Break
11:00
11:00- Measuring Data Quality : Practical – General Attribute Course Facilitator
11:30 Dependencies
11:30- Discussion : Agreeing on a minimum set of data quality All Participants
13:00 metrics for INDEPTH
13:00- Lunch Break
14:00
14:00- Applying agreed set of data quality metrics to own All Participants
17:00 database
Day 3 : 13 May 2010
8:30-10:00 Comparison & Standardisation of Minimum Data Course Facilitator
Quality Metrics
10:00- Tea Break
10:30
10:30- Total Data Quality Management : Theory Course Facilitator
11:00
11:00- Discussion : Data Quality Assurance in INDEPTH : The All Participants
12:30 Way Forward

3
12:30- Publication : Workshop Proceedings Course Facilitator
13:00
13:00- Lunch
14:00
14:00- INDEPTH Minimum Dataset INDEPTH
16:00 Secretariat

4
2 Curriculum
2.1 What is Data Quality?
2.1.1 Learning Objectives
1. Explain the different roles that can be identified in the information
production system
2. Understand the concept of an information product, and relate that to the
HDSS research context
3. Understand and explain the different concepts of data quality
4. Identify the dimensions of data quality most relevant to HDSS

1.1.1 Content
1. Information System Roles
2. Information Products
3. Concepts & Dimensions of Data Quality

1.1.1 Pre-reading and Reference Material


1. Carlo Batini, Monica Scannapieca. Data Quality. Concepts, Methodologies
and Techniques. 2006. Springer Berlin. Pp 1-49.
2. Jack E. Olson. Data Quality. The Accuracy Dimension. 2003. Morgan
Kaufmann. San Francisco. Pp 3-64.
3. Census Bureau Methodology & Standards Council. Census Bureau
Principle: Definition of Data Quality. 2006. US Census Bureau.
4. Danette McGilvray. Executing Data Quality Projects. Ten Steps to Quality
Data and Trusted Information. 2008. Morgan Kaufmann Burlington. Pp30-
33.
5. Tim Holt, Tim Jones. Quality work and conflicting quality objectives. 1998.
84th DGINS conference, Stockholm 28-29 May 1998. Office for National
Statistics, UK.

5
1.1 Impact of Data Quality on Demographic Measures
1.1.1 Learning Objectives
To be provided

1.1.2 Content
To be provided

1.1.3 Pre-reading and Reference Material


To be provided.

1.2 Extend and Implications of Poor Quality Data – iShare


Experience
1.2.1 Learning Objectives
To be provided

1.2.2 Content
To be provided

1.2.3 Pre-reading and Reference Material


To be provided.

6
1.3 Causes of Poor Quality Data
1.3.1 Learning Objectives
1. Able to classify and describe the causes of poor data quality

1.1.1 Content
1. Research Design
a. Research Question
b. Research Methodology
c. Data System Design
2. Population Factors
a. Education
b. Cultural
3. Data Collection
a. Field workers
b. Data collection instruments
c. Data Entry
4. Data Analysis
a. Data Conversion
b. Data Extraction
c. Data Cleaning

1.1.1 Pre-reading and Reference Material


1. Van den Broeck, J., S.A. Cunningham, R. Eeckels, and K. Herbst, Data
cleaning: detecting, diagnosing, and editing data abnormalities. PLoS
Med, 2005. 2(10): p. e267.

7
1.1 Measuring Data Quality
1.1.1 Learning Objectives
1. Classify, list and explain the different rules that can be applied to measure
data quality

1.1.1 Content
1. Data Quality Rules
a. Attribute domain constraints
b. Relational integrity constraints
c. Rules for historical data
d. Rules for state-dependent objects
e. General attribute dependency rules

1.1.1 Pre-reading and Reference Material


1. Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. Data Quality Assessment.
Communications of the ACM. April 2002/Vol. 45, No. 4ve. p211.
2. Arkady Maydanchik. Data Quality Assessment. 2007. Technics Publications.

1.1 Measuring Data Quality : Practical


1.1.1 Learning Objectives
1. Apply Data Quality Rules to DSS Reference Data Model to derive data
quality indicators

1.1.1 Content
The examples are all based on a sample database based on the INDEPTH
Reference Data Model. See Appendix A. The SQL used to derive the data
quality indicators are contained in Appendix B. The SQL dialect is SQL Server
2008 T-SQL.

1. Attribute domain constraints


a. Optionality Constraints

These constraints prevent attributes from taking Null, or missing, values. Default
values are often entered to circumvent the Not-Null constraints, i.e., the attribute is
populated with a default value when actual value is not available.

Example: Cause of Death codes

Cause n
Unassign
520
ed
Null 745
Indicator=1-Unassigned+NullTotal
Assigned 8225
Total 9490
Indicato 86.7
r %

b. Format Constraints

These constraints define the expected form in which the attribute values are stored in
the database field. Format constraints are most important when dealing with “legacy”

8
databases. However, even modern databases are full of surprises. From time to time,
numeric and date/time attributes are still stored in text fields.

Example : Surname field containing invalid characters.

Use wildcard characters or regular expressions to detect format violations. The specific
function is quite specific to particular database used. In SQL 2008 T-SQL, I am using
the PATINDEX function to find any LastName with a character not in the set of capital
and lower case alpha characters and a space and single quote (‘) character.

SELECT
COUNT(*)
FROM dbo.Individuals
WHERE PATINDEX('%[^a-zA-Z '']%',LastName)>0

LastNam n
e
12627
Valid
5
Invalid 137
12641 Indicator=1-InvalidTotal
Total
2
Indicato 99.9
r %

c. Valid Value Constraints

These constraints limit the permitted attribute values to a prescribed list or range.
Unfortunately, valid value lists are often unavailable, incomplete, or incorrect. To
identify valid values, we first need to collect counts of all actual values. These counts
can then be analyzed, and actual values can be cross- referenced against the valid
value list, if available. Values that are found in many records are probably valid, even if
they are missing from the data dictionary. Such circumstances typically arise when
new values are added after the original database design, but are not added to the
documentation. Values that have low frequency are suspect.

Example : Residency episode initiating event type.

Resident episode should only be started by DSS start, birth or in-migration.

Start n
Type
Valid 16854
4
Invalid 2109
Total 17065 Indicator=1-InvalidTotal
3
Indicato 98.8
r %

Example : Birth Weights

d. Precision and Granularity Constraints

These constraints require all values of an attribute to have the same precision,
granularity, and unit of measurement. Precision constraints can apply to both numeric
and date/time attributes. For numeric values, they define the desired number of
decimals. For date/time attributes, precision can be defined as calendar month, day,

9
hour, minute, or second. Data profiling can be used to calculate distribution of values
for each precision level.

Example : Date of Birth Precision

Date Precision n Score Formula


Precision
Day 1 15765 141885
Score=10-
Week 2 66 528
Precision×FrequencyPrecision
Fortnight 3 2 14
Month 4 519 3114
ScoreMax=PrecisionFrequency×
9
Quarter 5 11 55
Semester 6 67 268 ScoreTotal=PrecisionScore
Year 7 0 0
Indicator=1-ScoreMax-
Decade 8 0 0 ScoreTotalScoreMax
Unknown 9 0 0
Total 147870 16430 145864
Indicator 98.6%

Example : Migration Date Precision

Indicator Value
External In-Migration Date 77.7%
External Out-Migration Date 77.4%
Internal In-Migration Date 79.0%
Internal Out-Migration Date 78.1%

2. Relational integrity constraints


a. Identity rules
An identity rule validates that every record in a database table corresponds to one and
only one real world entity and that no two records reference the same entity.

Example : Potential Individual duplications

Similarity measure = Levenshtein distance1 (Firstnamea, Firstnameb) +


Levenshtein distance (Lastnamea,Lastnameb) +
Sexa=Sexb ? 0 : 1 +
ABS(YEAR(DoBa) -YEAR(DoBb)) +
ABS(MONTH(DoBa) - MONTH(DoBb)) +
ABS(DAY(DoBa) - DAY(DoBb))

Similari n
ty
0 42
1 238
Indicator=1-Individuals-

1 The Levenshtein distance between two strings is defined as the minimum number of edits
needed to transform one string into the other, with the allowable edit operations being
insertion, deletion, or substitution of a single character.

10
2 442 UniqueIndividualsIndividuals
3 820 = 99.4%
4 1699
5 3832
6 8349
7 16849
8 31836
9 59679
11003
10
8

Similarity = 1
IndA IndB Name A Name B Sex Sex DoB A DoB B
A B
4316 1992/09/0 1993/09/0
461 Nesbit, Nqobile Nesbit, Nqobile FEM FEM
9 8 8
100 1995/03/2 1995/03/2
1005 Nguyen, Simbongiwe Nguyen, Sibongiwe FEM FEM
1 5 5

Similarity = 2
Ind Ind B Name A Name B Sex Sex Do BA Do BB
A A B
138 1893 1983/05/1 1983/04/1
Mitchell, Hlengiwe Mitchell, Hlengiwe FEM FEM
8 8 6 5
337 1987/11/2 1987/11/2
3380 Myers, Sandile Myers, Zandile MAL FEM
8 5 5

Similarity = 3
Ind Ind B Name A Name B Sex Sex Do BA Do BB
A A B
1983/12/0 1983/12/0
84 85 Johnson, Ntando Johnson, Nontando MAL FEM
3 3
1976/05/0 1976/05/0
255 260 Sosibo, Thandiwe Sosibo, Thandeka FEM FEM
7 7
1994/08/1 1994/08/1
569 12191 Smith, Bongani Smith, Lindani MAL MAL
4 4
1997/12/0 1996/12/0
585 35418 García, Sanele García, Zanele MAL FEM
6 6

b. Reference rules
A reference rule ensures that every reference made from one entity occurrence to
another entity occurrence can be successfully resolved. Each reference rule is
represented in relational data models by a foreign key that ties an attribute or a
collection of attributes of one entity with the primary key of another entity. Foreign
keys guarantee that navigation of a reference across entities does not result in a “dead
end.”

Example : Child to Parent references.

Status Mother Father


Known 74,043 32,257
IndicatorA=1-MissingTotal
Missing 2,855 9,708
Unknown 49,514 84,447 Indicator B=1-
Total 126,412 126,412 Missing+UnknownTotal
Indicator A 97.7% 92.3%
Indicator B 58.6% 25.5%

c. Cardinal rules

11
A cardinal rule defines the constraints on relationship cardinality. Cardinal rules are not
to be confused with reference rules. Whereas reference rules are concerned with the
identity of the occurrences in referenced entities, cardinal rules define the allowed
number of such occurrences.

Residenc Wrong Correct


y
Exists 170653 124657
None 1755 1755 Indicator=1-NoneTotal
Total 172408 126412
Indicato
99.0% 98.6%
r

d. Inheritance rules
An inheritance rule expresses integrity constraints on entities that are associated
through generalization and specialization, or more technically through sub- typing.

Example : Not available.

3. Rules for historical data


a. Currency Rule

A currency rule enforces the desired “freshness” of the historical data. Currency rules
are usually expressed in the form of constraints on the effective date of the most
recent record in the history. For example if the status of an individual under
surveillance is 'Current', then the last visit date should be no earlier than the start of
the previous surveillance round.

Example 1 : Last observation for current residency episodes must be at least in


previous census round.

Currency Residenc
y
Episodes
Current 62621
Not Current 2384 Indicator=1-NotCurrentTotal
Total 65005
Indicator 96.3%
Example 2 : At year end, last status observation should not be prior than 1 July of that
year (older than 183 days)

b. Retention Rule

A retention rule enforces the desired depth of the historical data. Retention rules are
usually expressed in the form of constraints on the overall duration or the number of
records in the history.

c. Granularity rule

A granularity rule requires all measurement periods in an accumulator history to have


the same size.

E.g. If the surveillance implies a six monthly visit to each homestead, is that in fact the
case?

12
d. Continuity rule

A continuity rule prohibits gaps and overlaps in accumulator histories. Continuity rules
require that the beginning date of each measurement period immediately follows the
end date of the previous period.

For example for internal migrations, the next residency episode must follow directly on
the previous.

Example : Internal migrations

Continuity n
Continuity 18 657
Discontinuit Indicator=1-DiscontinuityTotal
y 6 430
Total 25 087
Indicator 74.4%

e. Timestamp pattern rule

A timestamp pattern rule requires all timestamps to fall into a certain prescribed date
interval, such as every March or every other Wednesday or between the first and fifth
of each month. Occasionally the pattern takes the form of minimum or maximum
length of time between measurements. For example, participants in a medical study
may be required to take blood pressure readings at least once a week. While the
length of time between particular measurements will differ, it has to be no longer than
seven days.

Example : Similar to granularity rule, homestead has to be visited at least once every
six months.

Locatio
Semester Visits ns
Visited 194,238
Not Visited 19,488 Indicator=1-NotVisitedTotal
Total 213,726
Indicator 90.9%

Note : Care should be taken with the type of observations used to derive this measure.
If for example only observation tied to residency and status observations are
considered, those locations visited where no observation was recorded due to non-
contact with the occupants will not be considered in this indicator.

f. Value Pattern Rule

Value histories for time-dependent attributes usually also follow systematic patterns. A
value pattern rule utilizes these patterns to predict reasonable ranges of values for
each measurement and identify likely outliers. Value pattern rules can restrict
direction, magnitude, or volatility of change in data values.

i. Direction of Change

The simplest value pattern rules restrict the direction in value changes from
measurement to measurement. A person's length is unlikely to decrease over
multiple measures in time, same for educational attainment.

Example: Educational attainment cannot decline.

13
Direction Measur
es
Invalid 21,178
Indicator=1-InvalidTotal
Valid 128,613
Total 149,791
Indicato
85.9%
r

ii.Magnitude of Change

It is usually expressed as a maximum (and occasionally minimum) allowed


change per unit of time.

Example : Educational attainment cannot increase by more than the


difference in years between two observation dates.

Direction Measur
es
Valid 117,626
Invalid
Direction
21,181 Indicator=1-
Invalid
10,984
InvalidDirection+InvalidMagnitudeTotal
Magnitude
Total 149,791
Indicator 78.5%

g. Event History rules


i. Event Dependencies

Various events often affect the same objects and therefore may be
interdependent. Data quality rules can use these dependencies to validate the
event histories. E.g. An out migration event cannot be recorded for an
individual without a prior birth or in- migration event.

Example : Outmigration events cannot be preceded by ‘Death, ’Visit’ or


‘Outmigration’ events.

Dependen n
cy
56,
Correct
817 Indicator=1-IncorrectTotal
Incorrect
3
56,
Total
820
99.99
Indicator
%

ii.Event Conditions

Events of many kinds do not occur at random but rather only happen under
certain unique circumstances. Event conditions verify these circumstances.

Example: Birth spacing, the time between two subsequent pregnancies with a
live birth outcome should not be less than 9 months (280 days).

Birth Spacing Pregnancies


Too Short 542
Indicator=1-TooShortTotal
Valid 40,935
Total 41,477

14
Indicator 98.7%

iii.Event-specific Attribute Constraints

Events themselves are often complex entities, each with numerous attributes.

Example: A pregnancy outcome event requires the mother to be of child


bearing age.

Birth Pregnanci
Spacing es
Valid 70,601
Invalid 1,285
Total 71,886
Indicator 98.2%

4. Rules for state-dependent objects

These rules place constraints on the lifecycle of objects described by so- called state-
transition models.

State-dependent objects go through a sequence of states in the course of their life


cycle as a result of various events. Data for the state-dependent objects is very
common in real world databases and is also most error- prone. Various data quality
rules can be implemented to validate such data. Some of these rules are rather simple,
while others can be quite complex and vary significantly depending on the data
structure. In all cases, data quality rules for state-dependent objects are key to
successful data quality assessment, since data for such objects is typically very
important and yet contains numerous "hidden" errors.

Vis
it
Census

Inmigration

Birth Under Death


Not under surveillance
Dead
surveillance (known
I nt

location)
e rn
al
Ou
tm

Outmigration
igra
In

tio
te

n
rn
al
In m

Under
ig

surveillance
ra
tio

(unknown
n

location)

a. State domain constraint

15
A state domain constraint limits the set of allowed states to only those shown
in the state- transition model. Invalid states are usually typos inside otherwise
valid records. The true state can often be deduced based on the action value.

b. Action domain constraint

An action domain constraint limits the set of allowed actions to only those shown in the
state-transition model. Invalid actions are usually typos inside otherwise valid records.
The true action can often be deduced based on the state value.

c. Terminator domain constraint

A terminator domain constraint limits the set of allowed terminators, specifically states
in which an object can start and end its life cycle. Invalid terminators often are a
symptom of missing records at the beginning of the life cycle.

Example : Invalid states at first transition

To State Actio n
n
INV HMS 1,838
INV INM 51
INV INT 9,205
SLK DLV 16,430
SLK DSS 62,633
SLK INM 34,500
124,65
Total
7
Indicato
91.1%
r

d. State-transition constraints

These constraints limit state changes to those allowed by the state- transition model.
For example, a person who is already out- migrated cannot be out-migrated again
without being in- migrated in between. Invalid state-transitions often signify a missing
action.

Example : Residency state transitions

Final Individual
State s
Invalid 16,409
Valid 108,248 Indicator=1-InvalidEndStateIndividuals
Total 124,657
Indicato
r
86.8% Indicator=1-InvalidTransitionTransitions
State Transition
s
Invalid 16,409
Valid 296,501
Total 312,910
Indicato
94.8%
r

16
Invalid Transition Causes

Invalid Reason Actio n %


n
Action disallowed if not under surveillance INT 11,329 69.0%
Invalid action HDS 1,161
19.8%
Invalid action HMS 2,088
Action cannot start a residency if at unknown
INM 1,620 9.9%
location
Temporal integrity violated HDS 3
Temporal integrity violated HMS 1
0.9%
Temporal integrity violated INT 78
Temporal integrity violated OTM 64
Action condition violated INM 55 0.3%
Action cannot start a residency if already at known
INM 4
location
0.1%
Action cannot start a residency if already at known
INT 6
location
Total 16,409

e. State-action constraints

Require that each action is consistent with the change in the object state. For example,
after an out migration, the state of an individual must be non-resident

f. Continuity rules

Prohibit gaps and overlaps in state-transition history. In other words, they require that
the effective date of each state record must immediately follow the end date of the
previous state record.

Example : See 3.d. Historical data, continuity rule

g. Duration rules

Put a constraint on the maximum and/or minimum length of time an object can stay in
any specific state. The simplest form of the duration rule is the zero-length rule, which
requires the length of time spent in each state to be greater than zero.

Example : Residency episode duration cannot be negative (end before start) or zero.
Total 170,653
Indicator 99.8%

h. Action pre-conditions

The conditions that must be satisfied before an action can take place. E.g. Mother must
be resident for a child to start residency with birth

Example : Mother’s state at child residency start, if child starts residency with
delivery.

Mothers Children
Mother resident 14,696

17
Mother non-
1,619 Indicator=1-
resident
Mothernonresident+MotherunknownChildr
Mother unknown 131
en
Total 16,446
Indicator 89.4%

i. Action post-conditions

These are the conditions that must be satisfied after the action is successfully
completed.

5. General attribute dependency rules

Rules that describe complex attribute relationships, including constraints on


redundant, derived, partially dependent, and correlated attributes.

a. Redundant attributes

Redundant attributes are data elements that represent the same attribute of a real
world object. While attribute redundancy goes against basic data modelling principles,
it is common in practice for several reasons. First, redundancy is widespread in
“legacy” databases and certain systems that were converted from the “legacy”
databases. Secondly, redundancy is often used even in modern relational databases to
improve efficiency of data access, information presentation, and transaction
processing. Finally, some data across different systems are invariably redundant.
Comparison of redundant attributes is a sure way to identify (and eventually correct)
numerous data problems.

Example: Link between mother and child explicit via MotherID and implicit via births
and pregnancies, both these should be consistent.

Link Pairs
Linked 15746 Of the cases where residency start is Birth and it is linked to a
Pregnancy, in one case this link between child and mother was not
Not reflected in the MotherID of the child.
1
linked
Indicator=1-NotLinkedPairs
Total 15747
Indicato 99.99
r %
The converse is slightly more complex. Of the children born to the mother while she
was resident, are all such children recorded as resident and the residency start marked
as Birth? Whether this test is absolute will depend on the eligibility rules of the HDSS.

Link Pairs

Birth not linked 1750 Child resident by birth is not linked to a resident mother via a
pregnancy.
Birth not Resident mother gave birth to a child that is not resident from
1102
resident birth.
Consistent 14696
Total 17548
Indicator=1-
BirthsNotLinked+BirthsNotResidentPairs
Indicator 83.7%

b. Derived Attributes

Values of derived attributes are calculated based on the values of some other
attributes. This approach is very common when the calculation is rather complex and
involves data stored in multiple records of possibly multiple entities. Performing the
calculation on the fly is then very inefficient. One of the most common special cases of
derived attribute constraints is a balancing rule, which requires an aggregate attribute
to equal the total of atomic level attribute values.

18
Example : Data should satisfy the demographic equation:

Populationt+1=Populationt+Birthst-Deathst+(Immigrationt-Emigrationt)

Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Population 0 66035 68027 67277 65785 64981 64916 65376 64289 63494
Start observation 62633 0 0 0 0 0 0 0 0 0
Births 1675 1723 1719 1641 1743 1748 1749 1692 1579 1101
Immigration 3887 5689 7033 6032 5111 5348 5511 5242 5279 4992
Deaths 886 1077 1098 1129 983 979 886 913 796 697
Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427
Population t+1 65386 67513 67487 66592 65452 65060 65530 65226 64649 63463
Balance -649 -514 210 807 471 144 154 937 1155
Indicator 99.0% 99.2% 99.7% 98.8% 99.3% 99.8% 99.8% 98.6% 98.2%
Provision made for contextual factors such as change in HDSS boundary and loss to
follow-up:

Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009

Population 0 66033 68025 67266 65654 64263 63436 65376 64289 63494
Start observation 62633 0 0 0 0 0 2239 0 0 0
Births 1675 1723 1719 1640 1729 1696 1685 1692 1579 1101
Immigration 3885 5689 7027 5974 4940 5069 5172 5242 5279 4992
Deaths 886 1077 1098 1129 983 979 886 913 796 697
Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427
Loss to Follow-up 56 50 96 78 93 67 204 538 635 36233
Population t+1 65328 67461 67383 66444 65043 63944 65682 64688 64014 27230
Balance -705 -564 117 790 780 508 306 399 520
Indicator 98.9% 99.2% 99.8% 98.8% 98.8% 99.2% 99.5% 99.4% 99.2%

c. Partially Dependant Attributes

The values of redundant and derived attributes are prescribed exactly by the
dependency. Oftentimes, the relationships between attributes are not so exact. The
value of one attribute may restrict possible values of another attribute to a smaller
subset, but not to a single value.

Example : Certain causes of death are only possible for women and/or men, e.g.
cancer of the cervix or causes related to maternal death.

Sex n
FEM 120 Causes of death that ought to be associated with women.

MAL 1
Indicator=1-MaleDeathsDeathsFemaleCauses
Total 121
Indicato 99.2
r %

d. Conditional Optionality

Conditional optionality represents situations where values of one attribute determine


whether or not the other attribute must take Null or not-Null value (i.e., is the value to
be prevented or required). Technically speaking, attributes with conditional optionality
are a special case of partially dependent attributes discussed above.

e. Correlated Attributes

19
Values of one attribute can change the likelihood of values of another one, though not
firmly restricting any possibilities. An example is the correlation between gender and
first name. The majority of names are distinctly male or female. Thus there is a definite
relationship between these attributes; however, the relationship is not exact in nature.

1.1 Total Data Quality Management : Theory


1.1.1 Learning Objectives
1. Able to identify the role players in data quality and their respective roles
2. Able to describe the basic principles of Total Data Quality Management
3. Able to list and describe the steps in the Ten Step Approach to Data
Quality Improvement

1.1.1 Content
1. Role Players
a. Data Collectors
b. Data Custodians
c. Data Consumers
2. Total Data Quality Management Cycle
a. Define
b. Measure
c. Analyse
d. Improve

1.1.1 Pre-reading and Reference Material


1. Carlo Batini, Monica Scannapieca. Data Quality. Concepts, Methodologies
and Techniques. 2006. Springer Berlin. Pp 161-188.
2. Danette McGilvray. Executing Data Quality Projects. Ten Steps to Quality
Data and Trusted Information. 2008. Morgan Kaufmann Burlington. Pp54-
58.

20
Appendix A : Sample Database

ResidentEpisodes Births Deaths


Locations

ResidentEpisode ResidentEpisode
Location ResidentEpisode
Pregnancy DeathCause
Latitude Individual
Birthweight DeathLocation
Longitude Location

StartDate StartDate

EndDate StartPrecision InMigrations OutMigrations


InitiatingEventType
FirstObservation
EndDate
Individuals EndPrecision
ResidentEpisode ResidentEpisode
Pregnancies TerminatingEventType
OriginLocation DestinationLocation
LastObservation
OriginPlace DestinationPlace
Reason
Observations Reason

Individual
LastName
Pregnancy
FirstName
Individual
Sex
StartDate
DoB
FirstObservation Observation
EndDate
EndDate Location
MotherID
TerminatingEventType CensusRounds CensusRound
FatherID StatusObservations
LastObservation ObservationDate
StillBorn Observer
LiveBorn ObservationType
BirthAttendant
BirthLocation

CensusRound
StatusObservationID
StartDate
Individual
EndDate
Observation
MaritalStatus
EducationLevel

21
Appendix B : SQL Scripts
--
--region Attribute Domain Constraints
--
--region Optionality Constraints
--
--region Cause of Death example
--
SELECT
DeathCause,
COUNT(*) n
FROM dbo.Deaths D
GROUP BY DeathCause
ORDER BY DeathCause --COUNT(*) DESC
--
SELECT
DeathCause,
MAX(C.Description) Description,
COUNT(*) n
FROM dbo.Deaths D
LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY DeathCause
ORDER BY DeathCause
--
-- Final formulation
--
SELECT
CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END Cause,
COUNT(*) n
FROM dbo.Deaths D
LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END
--
-- Data Quality Trend
--
SELECT
YEAR(E.EndDate) Year,
CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END Cause,
COUNT(*) n
FROM dbo.Deaths D
JOIN dbo.ResidentEpisodes E ON D.ResidentEpisode=E.ResidentEpisode
LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY YEAR(E.EndDate),
CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END
ORDER BY YEAR(E.EndDate),Cause
--endregion
--region Complex example - Internal migration destination
-- Destination location for internal migrations
SELECT
DestinationLocation, COUNT(*) n
FROM dbo.OutMigrations OM

22
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode
WHERE TerminatingEventType='INT'
GROUP BY DestinationLocation
ORDER BY COUNT(*) DESC
--
-- Grouped Destination
--
SELECT
CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
ELSE 'Known'
END Destination,
COUNT(*) n
FROM dbo.OutMigrations OM
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode
WHERE TerminatingEventType='INT'
GROUP BY CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
ELSE 'Known'
END
--
-- Further Investigation
--
SELECT
CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
WHEN L.Location IS NULL THEN 'Location wrong'
ELSE 'Known'
END Destination,
COUNT(*) n
FROM dbo.OutMigrations OM
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode
LEFT JOIN dbo.Locations L ON OM.DestinationLocation=L.Location
WHERE TerminatingEventType='INT'
GROUP BY CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
WHEN L.Location IS NULL THEN 'Location wrong'
ELSE 'Known'
END
--endregion
--endregion
--region Format Constraints
SELECT
COUNT(*) Total,
SUM(CASE WHEN PATINDEX('%[^a-zA-Z '']%',LastName)>0 THEN 1 ELSE 0 END) Invalid
FROM dbo.Individuals
--endregion
--region Valid Value Constraits
SELECT
InitiatingEventType,
COUNT(*) n
FROM dbo.ResidentEpisodes
GROUP BY InitiatingEventType
--
SELECT
YEAR(StartDate) Yr,
CASE
WHEN InitiatingEventType='HMS' THEN 'Invalid'
ELSE 'Valid'
END Validity,
COUNT(*) n
FROM dbo.ResidentEpisodes
GROUP BY YEAR(StartDate),
CASE

23
WHEN InitiatingEventType='HMS' THEN 'Invalid'
ELSE 'Valid'
END
ORDER BY Yr,Validity
--
-- Birth Weight
--
SELECT
Birthweight/100 W100q,
COUNT(*) n
FROM dbo.Births B
JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)
WHERE StartDate BETWEEN '20000101' AND '20101231'
GROUP BY Birthweight/100
ORDER BY Birthweight/100
--endregion
--region Precision and Granularity Contraints
--region Date of Birth
-- Birth Date
SELECT
StartPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DLV'
GROUP BY StartPrecision
ORDER BY StartPrecision
--endregion
--region Complex example Migration Date Precision
--
-- InMigration
SELECT
StartPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='INM'
GROUP BY StartPrecision
ORDER BY StartPrecision
--
-- Internal InMigration
SELECT
StartPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='INT'
GROUP BY StartPrecision
ORDER BY StartPrecision
--
-- OutMigration
SELECT
EndPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='OTM'
GROUP BY EndPrecision
ORDER BY EndPrecision
--
-- Internal OutMigration
SELECT
EndPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='INT'
GROUP BY EndPrecision
ORDER BY EndPrecision
--
-- Migration Precision by Time
--
WITH InPrecision AS (

24
SELECT
YEAR(StartDate) Yr,
StartPrecision [Precision],
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType IN ('INM','INT')
GROUP BY YEAR(StartDate),StartPrecision
),
OutPrecision AS (
SELECT
YEAR(EndDate) Yr,
EndPrecision [Precision],
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType IN ('INT','OTM')
GROUP BY YEAR(EndDate),EndPrecision
),
InScore AS (
SELECT
Yr,
SUM((10-[Precision])*n) Score,
SUM(9*n) MaxScore
FROM InPrecision
GROUP BY Yr
),
OutScore AS (
SELECT
Yr,
SUM((10-[Precision])*n) Score,
SUM(9*n) MaxScore
FROM OutPrecision
GROUP BY Yr
)
SELECT
I.Yr,
SUM(ISNULL(I.Score,0)+ISNULL(O.Score,0)),
SUM(ISNULL(I.MaxScore,0)+ISNULL(O.MaxScore,0))
FROM InScore I
JOIN OutScore O ON (I.Yr=O.Yr)
GROUP BY I.Yr
ORDER BY I.Yr
--endregion
--endregion
--endregion
--
--region Relational Integrity Constraints
--region Identity Rules
-- Duplicate Individuals
SELECT
*
INTO IndividualComparison
FROM dbo.udfSeekDuplicates()
--
SELECT
Similarity,
COUNT(*) n
FROM dbo.IndividualComparison
GROUP BY Similarity
ORDER BY Similarity
--
SELECT
C.IndA,C.IndB,
I1.FirstName FirstNameA, I2.FirstName FirstNameB,
I1.LastName LastNameA, I2.LastName LastNameB,
I1.Sex SexA, I2.Sex SexB,
I1.DoB DoBA, I2.DoB DoBB
FROM dbo.IndividualComparison C
JOIN dbo.Individuals I1 ON (C.IndA=I1.Individual)

25
JOIN dbo.Individuals I2 ON (C.IndB=I2.Individual)
WHERE C.Similarity=0
ORDER BY C.IndA,C.IndB
--region AC Specific
SELECT
C.IndA,C.IndB,
I1.Name NameA, I2.Name NameB,
I1.Sex SexA, I2.Sex SexB,
I1.DoB DoBA, I2.DoB DoBB
FROM dbo.IndividualComparison C
JOIN ACDIS.dbo.vacNamedIndividuals I1 ON (C.IndA=I1.IIntID)
JOIN ACDIS.dbo.vacNamedIndividuals I2 ON (C.IndB=I2.IIntID)
WHERE C.Similarity=0
ORDER BY C.IndA,C.IndB
--endregion
SELECT
COUNT(*)
FROM dbo.Individuals
--endregion
--region Reference Rules
--
-- Child to Parent linkages
-- MotherId on Child
SELECT
CASE
WHEN C.MotherID IS NULL THEN 'Unknown'
WHEN M.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END Mother,
COUNT(*) n
FROM dbo.Individuals C
LEFT JOIN dbo.Individuals M ON (C.MotherID=M.Individual)
GROUP BY CASE
WHEN C.MotherID IS NULL THEN 'Unknown'
WHEN M.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END
-- FatherId on Child
SELECT
CASE
WHEN C.FatherID IS NULL THEN 'Unknown'
WHEN F.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END Father,
COUNT(*) n
FROM dbo.Individuals C
LEFT JOIN dbo.Individuals F ON (C.FatherID=F.Individual)
GROUP BY CASE
WHEN C.FatherID IS NULL THEN 'Unknown'
WHEN F.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END
--endregion
--region Cardinal Rules
-- Incorrect formulation
SELECT
CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END Residency,
COUNT(*) n
FROM dbo.Individuals I
LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
GROUP BY CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END
--region Correct formulation

26
WITH UniqueResidencies AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
)
SELECT
CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END Residency,
COUNT(*) n
FROM dbo.Individuals I
LEFT JOIN UniqueResidencies R ON (I.Individual=R.Individual)
GROUP BY CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END
--
-- Residency Cardinality
--
WITH ResidencyCount AS (
SELECT
I.Individual,
COUNT(*) n
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
GROUP BY I.Individual
UNION
SELECT
I.Individual,
0 n
FROM dbo.Individuals I
LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
WHERE R.Individual IS NULL
)
SELECT
n ResidencyCardinality,
COUNT(*) Cnt
FROM ResidencyCount
GROUP BY n
ORDER BY n
--endregion
--endregion
--endregion
--
--region Rules for Historical Data
--region Currency Rule
--
-- Last visit of current residency episodes
SELECT
CensusRound,
MIN(ObservationDate) MinDate,
MAX(ObservationDate) MaxDate
FROM dbo.Observations
GROUP BY CensusRound
ORDER BY CensusRound
--
-- Start of previous round 13 Jul 2009
SELECT
CASE
WHEN EndDate>'20090712' THEN 'Current'
ELSE 'Not Current'
END Currency,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='VIS'
GROUP BY CASE
WHEN EndDate>'20090712' THEN 'Current'

27
ELSE 'Not Current'
END
--
-- Currency of Statusobservation, e.g. MaritalStatus
--
WITH YearEnds AS (
SELECT CAST('20001231' AS datetime) YearEnd
UNION
SELECT CAST('20011231' AS datetime) YearEnd
UNION
SELECT CAST('20021231' AS datetime) YearEnd
UNION
SELECT CAST('20031231' AS datetime) YearEnd
UNION
SELECT CAST('20041231' AS datetime) YearEnd
UNION
SELECT CAST('20051231' AS datetime) YearEnd
UNION
SELECT CAST('20061231' AS datetime) YearEnd
UNION
SELECT CAST('20071231' AS datetime) YearEnd
UNION
SELECT CAST('20081231' AS datetime) YearEnd
UNION
SELECT CAST('20091231' AS datetime) YearEnd
),
YearEndIndividuals AS (
SELECT DISTINCT
Individual,YearEnd
FROM dbo.ResidentEpisodes R
CROSS JOIN YearEnds
WHERE R.EndDate>=YearEnd
AND R.StartDate<YearEnd
),
SOCurrency AS (
SELECT
S.Individual,YearEnd,
MIN(DateDiff(day,O.ObservationDate,YearEnd)) Currency
FROM dbo.StatusObservations S
JOIN dbo.Observations O ON (S.Observation=O.Observation)
JOIN YearEndIndividuals I ON (S.Individual=I.Individual)
AND (O.ObservationDate<=I.YearEnd)
GROUP BY S.Individual,YearEnd
)
SELECT
I.YearEnd,
CASE
WHEN C.Currency IS NULL THEN 'Undefined'
WHEN C.Currency>183 THEN 'NotCurrent'
ELSE 'Current'
END Currency,
COUNT(*) n
FROM YearEndIndividuals I
LEFT JOIN SOCurrency C
ON (I.Individual=C.Individual) AND (I.YearEnd=C.YearEnd)
GROUP BY I.YearEnd,CASE
WHEN C.Currency IS NULL THEN 'Undefined'
WHEN C.Currency>183 THEN 'NotCurrent'
ELSE 'Current'
END
ORDER BY I.YearEnd,CASE
WHEN C.Currency IS NULL THEN 'Undefined'
WHEN C.Currency>183 THEN 'NotCurrent'
ELSE 'Current'
END
--endregion
--
--

28
--region Granularity Rule
WITH NumberedVisits AS (
SELECT
Location, CensusRound, ObservationDate,
ROW_NUMBER()
OVER(PARTITION BY Location, CensusRound
ORDER BY ObservationDate) RowNum,
COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt
FROM dbo.Observations
WHERE CensusRound BETWEEN 1 AND 21
AND ObservationDate BETWEEN '20000101' AND '20091231'
),
MidVisits AS (
SELECT
Location, CensusRound,
CAST(ObservationDate AS float) fDate,
RowNum, Cnt
FROM NumberedVisits
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
),
VisitGaps AS (
SELECT
R1.Location,
R1.CensusRound Rn,
R2.CensusRound Rnn,
DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity
FROM MedianVisits R1
JOIN MedianVisits R2
ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)
)
SELECT
Rn,Rnn,Granularity,COUNT(*) n
FROM VisitGaps
GROUP BY Rn,Rnn,Granularity
ORDER BY Rn,Rnn,Granularity
--
-- Quality indicator based on granularity
-- Gap should be +-15 days within 183 (twice yearly rounds)
--
SELECT
Rnn CensusRound,
CASE
WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange'
WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange'
ELSE 'Outside'
END Indicator,
COUNT(*) n
FROM dbo.vLocationVisitGaps
GROUP BY Rnn,
CASE
WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange'
WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange'
ELSE 'Outside'
END
ORDER BY Rnn, Indicator
--endregion

29
--region Continuity Rule
WITH NumberedEpisodes AS (
SELECT
Individual,
StartDate,
InitiatingEventType,
EndDate,
TerminatingEventType,
ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY StartDate) RowNum
FROM dbo.ResidentEpisodes
)
SELECT
YEAR(E2.StartDate) Yr,
CASE
WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext'
WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity'
ELSE 'Continuity'
END Continuity,
COUNT(*) n
FROM NumberedEpisodes E1
JOIN NumberedEpisodes E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
WHERE E1.TerminatingEventType='INT'
GROUP BY YEAR(E2.StartDate),
CASE
WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext'
WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity'
ELSE 'Continuity'
END
ORDER BY Yr, Continuity
--endregion
--region Timestamp pattern rule
WITH NumberedVisits AS (
SELECT
Location, CensusRound, ObservationDate,
ROW_NUMBER()
OVER(PARTITION BY Location, CensusRound
ORDER BY ObservationDate) RowNum,
COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt
FROM dbo.Observations
WHERE CensusRound BETWEEN 1 AND 21
AND ObservationDate BETWEEN '20000101' AND '20091231'
),
MidVisits AS (
SELECT
Location, CensusRound,
CAST(ObservationDate AS float) fDate,
RowNum, Cnt
FROM NumberedVisits
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
),
Semesters AS (
SELECT
1 AS Semester,
CAST('20000101' AS datetime) SemStart,
DATEADD(day,-1,DATEADD(quarter,2,'20000101')) SemEnd

30
UNION ALL
SELECT
Semester+1 Semester,
DATEADD(day,1,SemEnd) SemStart,
DATEADD(day,-1,DATEADD(quarter,2,DATEADD(day,1,SemEnd))) SemEnd
FROM Semesters
WHERE SemStart<'20090701'
),
SemesterVisits AS (
SELECT
Location,Semester,COUNT(*) n
FROM MedianVisits V
JOIN Semesters ON (MedianDate>=Semstart) AND (MedianDate<=SemEnd)
GROUP BY Location,Semester
)
SELECT
*
FROM SemesterVisits
ORDER BY Location,Semester
--endregion
--endregion
--region Value Pattern Rule
--region Direction of Change
--
-- Example : Educational Attainment
--
WITH EducationStatus AS (
SELECT
Individual,ObservationDate,Years,
ROW_NUMBER()
OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum
FROM dbo.StatusObservations SO
JOIN dbo.Observations O ON (SO.Observation=O.Observation)
JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel)
WHERE NOT E.Years IS NULL
)
SELECT
CASE
WHEN E2.Years>=E1.Years THEN 'Valid'
ELSE 'Invalid'
END Direction,
COUNT(*) Measures
FROM EducationStatus E1
JOIN EducationStatus E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
GROUP BY CASE
WHEN E2.Years>=E1.Years THEN 'Valid'
ELSE 'Invalid'
END
--
--endregion
--region Magnitude of Change
WITH EducationStatus AS (
SELECT
Individual,ObservationDate,Years,
ROW_NUMBER()
OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum
FROM dbo.StatusObservations SO
JOIN dbo.Observations O ON (SO.Observation=O.Observation)
JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel)
WHERE NOT E.Years IS NULL
)
SELECT
CASE
WHEN E2.Years<E1.Years THEN 'Invalid Direction'
WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate)
THEN 'Invalid Magnitude'
ELSE 'Valid'

31
END Direction,
COUNT(*) Measures
FROM EducationStatus E1
JOIN EducationStatus E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
GROUP BY CASE
WHEN E2.Years<E1.Years THEN 'Invalid Direction'
WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate)
THEN 'Invalid Magnitude'
ELSE 'Valid'
END
--endregion
--endregion
--region Event History Rule
--region Event Dependencies
--
-- Out migration not preceded by Death, Visit or Outmigration
--
WITH Events AS (
SELECT
Individual,
InitiatingEventType Event,
StartDate EventDate,
ResidentEpisode
FROM dbo.ResidentEpisodes
WHERE StartDate<>Enddate
UNION ALL
SELECT
Individual,
TerminatingEventType Event,
EndDate EventDate,
ResidentEpisode
FROM dbo.ResidentEpisodes
WHERE StartDate<>Enddate
),
NumberedEvents AS (
SELECT
Individual,
Event,EventDate,
ROW_NUMBER()
OVER(PARTITION BY Individual ORDER BY EventDate, ResidentEpisode) RowNum
FROM Events
)
SELECT
CASE
WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect'
ELSE 'Correct'
END Dependency,
--E1.Event,
COUNT(*) n
FROM NumberedEvents E1
JOIN NumberedEvents E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
WHERE E2.Event='OTM'
GROUP BY --E1.Event
CASE
WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect'
ELSE 'Correct'
END
--endregion
--region Event Conditions
--
-- Pregnancies with live births should be spaced by 9 months (280 days)
--
WITH NumberedPregnancies AS (
SELECT
Individual,EndDate DeliveryDate,
ROW_NUMBER()

32
OVER(PARTITION BY Individual ORDER BY EndDate) RowNum
FROM dbo.Pregnancies
WHERE LiveBorn>0
)
SELECT
CASE
WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort'
ELSE 'Valid'
END BirthSpacing,
COUNT(*) Pregnancies
FROM NumberedPregnancies P1
JOIN NumberedPregnancies P2
ON (P1.Individual=P2.Individual) AND (P1.RowNum=P2.RowNum-1)
GROUP BY CASE
WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort'
ELSE 'Valid'
END
--endregion
--region Event-specific attribute constraints
SELECT
CASE
WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid'
ELSE 'Invalid'
END BirthSpacing,
COUNT(*) Pregnancies
FROM dbo.Pregnancies P
JOIN dbo.Individuals I ON (P.Individual=I.Individual)
GROUP BY CASE
WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid'
ELSE 'Invalid'
END
--end region
--endregion
--endregion
--
--region Rules for state-dependent objects
--region State domain constraint
--endregion
--region Action domain constraint
--endregion
--region Terminator domain constraint
SELECT
ToState,Action,COUNT(*) n
FROM dbo.udfStateTransitions('20000101')
WHERE Transition=1
GROUP BY ToState,Action
ORDER BY ToState,Action
--endregion
--region State-transition constraints
--
-- Individuals with invalid end states
--
WITH LastTransition AS (
SELECT
Individual,MAX(Transition) LastTransition
FROM dbo.udfStateTransitions('20000101')
GROUP BY Individual
)
SELECT
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END Quality,
COUNT(*) Individuals
FROM dbo.udfStateTransitions('20000101') T
JOIN LastTransition LT
ON (T.Individual=LT.Individual) AND (T.Transition=LT.LastTransition)
GROUP BY

33
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END
--
-- Invalid transitions
--
SELECT
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END Quality,
COUNT(*) Transitions
FROM dbo.udfStateTransitions('20000101') T
GROUP BY
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END
--
-- Breakdown of invalid transitions
--
SELECT
InvalidReason,Action,
COUNT(*) n
FROM dbo.udfStateTransitions('20000101') T
WHERE ToState='INV'
GROUP BY InvalidReason, Action
ORDER BY InvalidReason, Action
--
-- Breakdown by surveillance round
--
SELECT
O.CensusRound,
SUM(CASE WHEN ToState='INV' THEN 1 ELSE 0 END) Invalid,
SUM(CASE WHEN ToState='INV' THEN 0 ELSE 1 END) Valid,
COUNT(*) Transitions
FROM dbo.udfStateTransitions('20000101') T
JOIN dbo.Observations O ON (T.Observation=O.Observation)
GROUP BY O.CensusRound
ORDER BY O.CensusRound
--endregion
--region State-action constraints
--endregion
--region Continuity rules
--endregion
--region Duration rules
--
-- Residency episode cannot be of zero or negative duration
--
SELECT
YEAR(StartDate) Yr,
CASE
WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid'
ELSE 'Invalid'
END Duration,
COUNT(*) Episodes
FROM dbo.ResidentEpisodes
GROUP BY YEAR(StartDate),
CASE
WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid'
ELSE 'Invalid'
END
ORDER BY Yr,Duration
--endregion
--region Action pre-conditions
WITH ResidentBabies AS (
SELECT

34
Individual Baby,StartDate
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DLV'
),
ResidentBabyMothers AS (
SELECT DISTINCT
B.Baby,
I.MotherID Mother
FROM dbo.Individuals I
JOIN ResidentBabies B ON (I.Individual=B.Baby)
WHERE NOT MotherID IS NULL
)
SELECT
CASE
WHEN BM.Baby IS NULL THEN 'Mother unknown'
WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident'
ELSE 'Mother resident'
END Mothers,
COUNT(*) Babies
FROM ResidentBabies B
LEFT JOIN ResidentBabyMothers BM ON (B.Baby=BM.Baby)
LEFT JOIN dbo.ResidentEpisodes RE
ON (BM.Mother=RE.Individual)
AND (RE.StartDate<=B.StartDate) AND (RE.EndDate>=B.StartDate)
GROUP BY
CASE
WHEN BM.Baby IS NULL THEN 'Mother unknown'
WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident'
ELSE 'Mother resident'
END
--endregion
--region Action post-conditions
--endregion
--endregion
--
--region General attribute dependency rules
--region Redundant attributes
--
-- Are all cases where residency is started by birth
-- which is linked to a pregnancy and then to the mother,
-- also reflected in the MotherID link of the child?
--
WITH DirectMCLink AS ( --76898 pairs
SELECT
MotherID,
Individual ChildID
FROM dbo.Individuals
WHERE NOT MotherID IS NULL
),
IndirectMCLink AS ( --15747
SELECT DISTINCT
P.Individual MotherID,
R.Individual ChildID
FROM dbo.Pregnancies P
JOIN dbo.Births B ON (P.Pregnancy=B.Pregnancy)
JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)
)
SELECT
CASE
WHEN D.MotherID IS NULL THEN 'Not linked'
ELSE 'Linked'
END Link,
COUNT(*) Pairs
FROM IndirectMCLink I

35
LEFT JOIN DirectMCLink D
ON (I.MotherID=D.MotherID) AND (I.ChildID=D.ChildID)
GROUP BY
CASE
WHEN D.MotherID IS NULL THEN 'Not linked'
ELSE 'Linked'
END
--
-- Of the children born to the mother while she was resident,
-- are all such children recorded as resident
-- and the residency start marked as Birth?
WITH MotherBirths AS ( --21907
SELECT
MotherID,
Individual ChildID,
DoB
FROM dbo.Individuals
WHERE NOT MotherID IS NULL
AND DoB>='20000101' -- After start of DSS
),
BirthsDuringResidency AS ( --15798
SELECT
B.*
FROM MotherBirths B
JOIN dbo.ResidentEpisodes R
ON (R.Individual=B.MotherID)
AND (B.DoB>=R.StartDate)
AND (B.DoB<=R.EndDate)
),
ResidenciesFromBirth AS ( --16430
SELECT
Individual ChildID
FROM dbo.Births B
JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)
)
SELECT
CASE
WHEN A.ChildID IS NULL THEN 'Birth not linked'
WHEN B.ChildID IS NULL THEN 'Birth not resident'
ELSE 'Consistent'
END Link,
COUNT(*) Pairs
FROM BirthsDuringResidency A
FULL JOIN ResidenciesFromBirth B ON (A.ChildID=B.ChildID)
GROUP BY
CASE
WHEN A.ChildID IS NULL THEN 'Birth not linked'
WHEN B.ChildID IS NULL THEN 'Birth not resident'
ELSE 'Consistent'
END
ORDER BY Link
--endregion
--region Derived Attributes
--
-- Data should satisfy the demographic equation
--
-- Resident Population at start of year
SELECT
'Population' AS Component,
SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0
END) Y2000,

36
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0
END) Y2001,
SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0
END) Y2002,
SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0
END) Y2003,
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0
END) Y2004,
SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0
END) Y2005,
SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0
END) Y2006,
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0
END) Y2007,
SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0
END) Y2008,
SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0
END) Y2009
FROM dbo.ResidentEpisodes
UNION
SELECT
'Start observation' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DSS'
UNION
SELECT
'Births' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DLV'
UNION
SELECT
'Immigration' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,

37
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'
UNION
SELECT
'Deaths' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='DTH'
UNION
SELECT
'Emigration' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'
--
-- Taking into account contextual factors, such as change in DSS boundary
--
WITH CensoredEpisodes AS (
SELECT
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS'
ELSE R.InitiatingEventType
END InitiatingEventType,
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001'
ELSE R.StartDate
END StartDate,
R.EndDate,
R.TerminatingEventType
FROM dbo.ResidentEpisodes R
JOIN dbo.Locations L ON (R.Location=L.Location)
)
SELECT
'Population' AS Component,
SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0
END) Y2000,
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0
END) Y2001,
SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0
END) Y2002,
SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0
END) Y2003,

38
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0
END) Y2004,
SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0
END) Y2005,
SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0
END) Y2006,
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0
END) Y2007,
SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0
END) Y2008,
SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0
END) Y2009
FROM CensoredEpisodes
UNION
SELECT
'Start observation' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DSS'
UNION
SELECT
'Births' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DLV'
UNION
SELECT
'Immigration' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'
UNION
SELECT
'Deaths' AS Component,

39
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='DTH'
UNION
SELECT
'Emigration' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'
--
-- Taking into account contextual factors and loss to follow-up
--
WITH CensoredEpisodes AS (
SELECT
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS'
ELSE R.InitiatingEventType
END InitiatingEventType,
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001'
ELSE R.StartDate
END StartDate,
R.EndDate,
R.TerminatingEventType
FROM dbo.ResidentEpisodes R
JOIN dbo.Locations L ON (R.Location=L.Location)
)
SELECT
'Population' AS Component,
SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0
END) Y2000,
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0
END) Y2001,
SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0
END) Y2002,
SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0
END) Y2003,
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0
END) Y2004,
SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0
END) Y2005,
SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0
END) Y2006,

40
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0
END) Y2007,
SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0
END) Y2008,
SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0
END) Y2009
FROM CensoredEpisodes
UNION
SELECT
'Start observation' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DSS'
UNION
SELECT
'Births' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DLV'
UNION
SELECT
'Immigration' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'
UNION
SELECT
'Deaths' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,

41
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='DTH'
UNION
SELECT
'Emigration' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'
UNION
SELECT
'Loss to Follow-up' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='VIS'
--
-- Find 705 people present in 2001 in excess of expectations
--
WITH PresentIn2001 AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
WHERE StartDate<'20010101' AND EndDate>='20010101'
),
CameIn2000 AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
WHERE YEAR(StartDate)=2000
AND InitiatingEventType IN ('DSS','DLV','INM')
),
LeftIn2000 AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
WHERE YEAR(EndDate)=2000
AND TerminatingEventType IN ('DTH','VIS','OTM')
)
SELECT
A.Individual

42
FROM PresentIn2001 A
JOIN LeftIn2000 B ON (A.Individual=B.Individual)
--
SELECT
*
FROM dbo.ResidentEpisodes
WHERE Individual=56179
--endregion
--region Partially Dependant Attributes
SELECT
D.DeathCause, C.Description,
COUNT(*) n
FROM dbo.Deaths D
JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY D.DeathCause,C.Description
ORDER BY n DESC
--
SELECT
I.Sex,
COUNT(*) n
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode)
WHERE DeathCause IN
('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57')
GROUP BY I.Sex
--
SELECT
I.Sex,C.Description
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode)
JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
WHERE DeathCause IN
('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57')
AND I.Sex='MAL'
--endregion
--endregion

Procedures, Views and User-defined Functions


CREATE FUNCTION dbo.udfStateTransitions(@DSSStart datetime)
RETURNS @Transitions TABLE (
[RecID] int IDENTITY (1, 1) NOT NULL PRIMARY KEY NONCLUSTERED,
Individual int NOT NULL,
Transition int NOT NULL,
FromState char(3) NOT NULL, --NUS (Not under surveillance)
--SLK (under surveillance location known)
--SLU (under surveillance location unknown)
--DTH (Death)
--INV (Invalid state)
ToState char(3) NOT NULL,
Action char(3) NOT NULL, --DSS (Surveillance Start),
--INM (Inmigration),
--DLV (Delivery),
--INT (Internal migration),
--DTH (Death),
--OTM (Outmigration),
--VIS (Visit),
--INV (Invalid action)
TransitionDate datetime NOT NULL,
Observation int NOT NULL,
InvalidReason varchar(80) NULL
)

43
AS BEGIN
DECLARE @Individual int
DECLARE @DoB datetime
DECLARE @InitiatingEventType char(3)
DECLARE @StartDate datetime
DECLARE @TerminatingEventType char(3)
DECLARE @EndDate datetime
DECLARE @Transition int
DECLARE @NextState char(3)
DECLARE @CurrentState char(3)
DECLARE @LastEvent char(3)
DECLARE @LastIndividual int
DECLARE @LastDate datetime
DECLARE @FirstObservation int
DECLARE @LastObservation int

DECLARE C CURSOR LOCAL FAST_FORWARD FOR


SELECT
I.Individual,DoB,
InitiatingEventType,StartDate,FirstObservation,
TerminatingEventType,R.EndDate,LastObservation
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
ORDER BY Individual,StartDate,ResidentEpisode;

OPEN C;

SET @LastIndividual=-1;

FETCH C INTO @Individual, @DoB,


@InitiatingEventType, @StartDate, @FirstObservation,
@TerminatingEventType, @EndDate, @LastObservation
WHILE (@@FETCH_STATUS=0) BEGIN
IF (@LastIndividual<>@Individual) BEGIN --next individual
SET @CurrentState='NUS';
SET @LastDate=@DoB;
SET @Transition=0;
SET @LastIndividual=@Individual
END;
-- Do start event transition
SET @Transition = @Transition+1;
IF (@CurrentState='NUS') BEGIN
IF (@InitiatingEventType='DSS' AND @StartDate=@DSSStart AND @Transition=1 AND
@LastDate<=@StartDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate, @FirstObservation);
END
ELSE IF (@InitiatingEventType='INM' AND @StartDate>@DSSStart AND
@LastDate<=@StartDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate, @FirstObservation);
END
ELSE IF (@InitiatingEventType='DLV' AND @StartDate=@DoB AND @Transition=1 AND
@LastDate<=@StartDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate, @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('INT','DTH','OTM','VIS')) BEGIN
SET @NextState='INV';

44
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action disallowed if not under surveillance', @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('DSS','INM','DLV')) BEGIN --Invalid action
condition
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action condition violated', @FirstObservation);
END
ELSE IF (@LastDate>@StartDate) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Temporal integrity violated', @FirstObservation);
END
ELSE BEGIN --Invalid event
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Invalid action', @FirstObservation);
END;
END;
IF (@CurrentState='SLK') BEGIN
IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency', @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('INT','INM','DLV','DSS')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency if already at known location',
@FirstObservation);
END
ELSE BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Invalid action', @FirstObservation);
END
END;
IF (@CurrentState='SLU') BEGIN
IF (@InitiatingEventType='INT' AND @LastDate<=@StartDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate, @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN

45
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency', @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('INM','DLV','DSS')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency if at unknown location',
@FirstObservation);
END
ELSE IF (@LastDate>@StartDate) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Temporal integrity violated', @FirstObservation);
END
ELSE BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Invalid action', @FirstObservation);
END
END;
IF (@CurrentState='DTH') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,
@StartDate,'No transitions after terminating state', @FirstObservation);
END;
SET @LastDate=@StartDate;
SET @CurrentState=@NextState;
SET @Transition=@Transition+1;
-- Do end event transition
IF (@CurrentState='NUS') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType
,@EndDate,'Cannot be not under surveillance before residency end',
@LastObservation);
END;
IF (@CurrentState='SLK') BEGIN
IF (@TerminatingEventType='INT' AND @LastDate<@EndDate) BEGIN
SET @NextState='SLU';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
END
ELSE IF (@TerminatingEventType='OTM' AND @LastDate<@EndDate) BEGIN
SET @NextState='NUS';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);

46
END
ELSE IF (@TerminatingEventType='VIS' AND @LastDate<=@EndDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
END
ELSE IF (@TerminatingEventType='DTH' AND @LastDate<=@EndDate) BEGIN
SET @NextState='DTH';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
END
ELSE IF (@TerminatingEventType IN ('INM','DSS','DLV')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate,'Action cannot end a residency', @LastObservation);
END
ELSE IF (@LastDate>=@EndDate) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate,'Temporal integrity violated', @LastObservation);
END
ELSE BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate,'Invalid action', @LastObservation);
END
END;
IF (@CurrentState='SLU') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType
,@EndDate,'Cannot be at unknown location before residency end', @LastObservation);
END;
IF (@CurrentState='DTH') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType
,@EndDate,'Cannot be dead before residency end', @LastObservation);
END;
SET @LastDate=@EndDate;
SET @CurrentState=@NextState;
FETCH C INTO @Individual, @DoB,
@InitiatingEventType, @StartDate, @FirstObservation,
@TerminatingEventType, @EndDate, @LastObservation
END;
CLOSE C;
DEALLOCATE C;
RETURN
END

47
CREATE FUNCTION dbo.udfSeekDuplicates ()
RETURNS @Duplicates TABLE (
IndA int,
IndB int,
Similarity int
)
AS BEGIN
DECLARE @Individual int
DECLARE @LastName varchar(50)
DECLARE @FirstName varchar(50)
DECLARE @Sex char(3)
DECLARE @DoB datetime

DECLARE C CURSOR LOCAL FAST_FORWARD FOR


SELECT
Individual,LastName,FirstName,Sex,DoB
FROM dbo.Individuals
ORDER BY Individual

OPEN C;

FETCH C INTO @Individual,@LastName,@FirstName,@Sex,@DoB;

WHILE (@@FETCH_STATUS=0) BEGIN


INSERT INTO @Duplicates
SELECT
@Individual,Individual,
dbo.fnacLevenshtein(@LastName,LastName)+
dbo.fnacLevenshtein(@FirstName,FirstName)+
CASE
WHEN @Sex=Sex THEN 0
ELSE 1
END +
ABS(YEAR(@DoB)-YEAR(DoB)) +
ABS(MONTH(@DoB)-MONTH(DoB)) +
ABS(DAY(@DoB)-DAY(DoB))
FROM dbo.Individuals
WHERE @Individual<Individual -- Do not re-evaluate inverse
AND ABS(DATEDIFF(day,@DoB,DoB))<366
AND dbo.fnacLevenshtein(@LastName,LastName)<10
AND dbo.fnacLevenshtein(@FirstName,FirstName)<5
FETCH C INTO @Individual,@LastName,@FirstName,@Sex,@DoB;
END;
CLOSE C;
DEALLOCATE C;

RETURN
END

CREATE VIEW dbo.vLocationVisitGaps


AS
WITH NumberedVisits AS (
SELECT
Location, CensusRound, ObservationDate,
ROW_NUMBER()
OVER(PARTITION BY Location, CensusRound
ORDER BY ObservationDate) RowNum,
COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt
FROM dbo.Observations
WHERE CensusRound BETWEEN 1 AND 21
AND ObservationDate BETWEEN '20000101' AND '20091231'
),
MidVisits AS (
SELECT
Location, CensusRound,
CAST(ObservationDate AS float) fDate,
RowNum, Cnt
FROM NumberedVisits

48
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
)
SELECT
R1.Location,
R1.CensusRound Rn,
R2.CensusRound Rnn,
DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity
FROM MedianVisits R1
JOIN MedianVisits R2
ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)

CREATE VIEW dbo.vLocationVisitGaps


AS
WITH NumberedVisits AS (
SELECT
Location, CensusRound, ObservationDate,
ROW_NUMBER()
OVER(PARTITION BY Location, CensusRound
ORDER BY ObservationDate) RowNum,
COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt
FROM dbo.Observations
WHERE CensusRound BETWEEN 1 AND 21
AND ObservationDate BETWEEN '20000101' AND '20091231'
),
MidVisits AS (
SELECT
Location, CensusRound,
CAST(ObservationDate AS float) fDate,
RowNum, Cnt
FROM NumberedVisits
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
)
SELECT
R1.Location,
R1.CensusRound Rn,
R2.CensusRound Rnn,
DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity
FROM MedianVisits R1

49
JOIN MedianVisits R2
ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)

50

Das könnte Ihnen auch gefallen