Beruflich Dokumente
Kultur Dokumente
1 Outcomes
1. Minimum set of INDEPTH Data Quality Metrics defined
2. Site data quality baselines established
3. Common outline and criteria for site data quality projects agreed to
4. Recommendation made for an INDEPTH Data Quality Assurance Program.
2
1 Program
Time Topic Presenter
Day 1 : 11 May 2010
8:00-9:00 Registration INDEPTH
Secretariat
9:00-9:30 Welcome and Introduction to Workshop Objectives INDEPTH
Executive
Course Facilitator
9:30-10:30 What is Data Quality? Course Facilitator
10:30- Tea Break
11:00
11:00- Impact of Data Quality on Demographic Measures Ayaga Bawah
11:30
11:30- Extend and Implications of Poor Quality Data – iShare iShare
12:00 Experience representative
12:00- Causes of Poor Quality Data Course Facilitator
12:30
12:30- Lunch Break
13:30
13:30- Measuring Data Quality : Theory Course Facilitator
14:30
14:30- Measuring Data Quality : iShare Experience iShare
15:30 representative
15:30- Tea Break
16:00
16:00- Measuring Data Quality : Practical - Attribute domain Course Facilitator
17:00 constraints
Day 2 : 12 May 2010
8:30-9:30 Measuring Data Quality : Practical – Relational integrity Course Facilitator
constraints
9:30-10:30 Measuring Data Quality : Practical – Historical Data & Course Facilitator
State Dependant Objects
10:30- Tea Break
11:00
11:00- Measuring Data Quality : Practical – General Attribute Course Facilitator
11:30 Dependencies
11:30- Discussion : Agreeing on a minimum set of data quality All Participants
13:00 metrics for INDEPTH
13:00- Lunch Break
14:00
14:00- Applying agreed set of data quality metrics to own All Participants
17:00 database
Day 3 : 13 May 2010
8:30-10:00 Comparison & Standardisation of Minimum Data Course Facilitator
Quality Metrics
10:00- Tea Break
10:30
10:30- Total Data Quality Management : Theory Course Facilitator
11:00
11:00- Discussion : Data Quality Assurance in INDEPTH : The All Participants
12:30 Way Forward
3
12:30- Publication : Workshop Proceedings Course Facilitator
13:00
13:00- Lunch
14:00
14:00- INDEPTH Minimum Dataset INDEPTH
16:00 Secretariat
4
2 Curriculum
2.1 What is Data Quality?
2.1.1 Learning Objectives
1. Explain the different roles that can be identified in the information
production system
2. Understand the concept of an information product, and relate that to the
HDSS research context
3. Understand and explain the different concepts of data quality
4. Identify the dimensions of data quality most relevant to HDSS
1.1.1 Content
1. Information System Roles
2. Information Products
3. Concepts & Dimensions of Data Quality
5
1.1 Impact of Data Quality on Demographic Measures
1.1.1 Learning Objectives
To be provided
1.1.2 Content
To be provided
1.2.2 Content
To be provided
6
1.3 Causes of Poor Quality Data
1.3.1 Learning Objectives
1. Able to classify and describe the causes of poor data quality
1.1.1 Content
1. Research Design
a. Research Question
b. Research Methodology
c. Data System Design
2. Population Factors
a. Education
b. Cultural
3. Data Collection
a. Field workers
b. Data collection instruments
c. Data Entry
4. Data Analysis
a. Data Conversion
b. Data Extraction
c. Data Cleaning
7
1.1 Measuring Data Quality
1.1.1 Learning Objectives
1. Classify, list and explain the different rules that can be applied to measure
data quality
1.1.1 Content
1. Data Quality Rules
a. Attribute domain constraints
b. Relational integrity constraints
c. Rules for historical data
d. Rules for state-dependent objects
e. General attribute dependency rules
1.1.1 Content
The examples are all based on a sample database based on the INDEPTH
Reference Data Model. See Appendix A. The SQL used to derive the data
quality indicators are contained in Appendix B. The SQL dialect is SQL Server
2008 T-SQL.
These constraints prevent attributes from taking Null, or missing, values. Default
values are often entered to circumvent the Not-Null constraints, i.e., the attribute is
populated with a default value when actual value is not available.
Cause n
Unassign
520
ed
Null 745
Indicator=1-Unassigned+NullTotal
Assigned 8225
Total 9490
Indicato 86.7
r %
b. Format Constraints
These constraints define the expected form in which the attribute values are stored in
the database field. Format constraints are most important when dealing with “legacy”
8
databases. However, even modern databases are full of surprises. From time to time,
numeric and date/time attributes are still stored in text fields.
Use wildcard characters or regular expressions to detect format violations. The specific
function is quite specific to particular database used. In SQL 2008 T-SQL, I am using
the PATINDEX function to find any LastName with a character not in the set of capital
and lower case alpha characters and a space and single quote (‘) character.
SELECT
COUNT(*)
FROM dbo.Individuals
WHERE PATINDEX('%[^a-zA-Z '']%',LastName)>0
LastNam n
e
12627
Valid
5
Invalid 137
12641 Indicator=1-InvalidTotal
Total
2
Indicato 99.9
r %
These constraints limit the permitted attribute values to a prescribed list or range.
Unfortunately, valid value lists are often unavailable, incomplete, or incorrect. To
identify valid values, we first need to collect counts of all actual values. These counts
can then be analyzed, and actual values can be cross- referenced against the valid
value list, if available. Values that are found in many records are probably valid, even if
they are missing from the data dictionary. Such circumstances typically arise when
new values are added after the original database design, but are not added to the
documentation. Values that have low frequency are suspect.
Start n
Type
Valid 16854
4
Invalid 2109
Total 17065 Indicator=1-InvalidTotal
3
Indicato 98.8
r %
These constraints require all values of an attribute to have the same precision,
granularity, and unit of measurement. Precision constraints can apply to both numeric
and date/time attributes. For numeric values, they define the desired number of
decimals. For date/time attributes, precision can be defined as calendar month, day,
9
hour, minute, or second. Data profiling can be used to calculate distribution of values
for each precision level.
Indicator Value
External In-Migration Date 77.7%
External Out-Migration Date 77.4%
Internal In-Migration Date 79.0%
Internal Out-Migration Date 78.1%
Similari n
ty
0 42
1 238
Indicator=1-Individuals-
1 The Levenshtein distance between two strings is defined as the minimum number of edits
needed to transform one string into the other, with the allowable edit operations being
insertion, deletion, or substitution of a single character.
10
2 442 UniqueIndividualsIndividuals
3 820 = 99.4%
4 1699
5 3832
6 8349
7 16849
8 31836
9 59679
11003
10
8
Similarity = 1
IndA IndB Name A Name B Sex Sex DoB A DoB B
A B
4316 1992/09/0 1993/09/0
461 Nesbit, Nqobile Nesbit, Nqobile FEM FEM
9 8 8
100 1995/03/2 1995/03/2
1005 Nguyen, Simbongiwe Nguyen, Sibongiwe FEM FEM
1 5 5
Similarity = 2
Ind Ind B Name A Name B Sex Sex Do BA Do BB
A A B
138 1893 1983/05/1 1983/04/1
Mitchell, Hlengiwe Mitchell, Hlengiwe FEM FEM
8 8 6 5
337 1987/11/2 1987/11/2
3380 Myers, Sandile Myers, Zandile MAL FEM
8 5 5
Similarity = 3
Ind Ind B Name A Name B Sex Sex Do BA Do BB
A A B
1983/12/0 1983/12/0
84 85 Johnson, Ntando Johnson, Nontando MAL FEM
3 3
1976/05/0 1976/05/0
255 260 Sosibo, Thandiwe Sosibo, Thandeka FEM FEM
7 7
1994/08/1 1994/08/1
569 12191 Smith, Bongani Smith, Lindani MAL MAL
4 4
1997/12/0 1996/12/0
585 35418 García, Sanele García, Zanele MAL FEM
6 6
b. Reference rules
A reference rule ensures that every reference made from one entity occurrence to
another entity occurrence can be successfully resolved. Each reference rule is
represented in relational data models by a foreign key that ties an attribute or a
collection of attributes of one entity with the primary key of another entity. Foreign
keys guarantee that navigation of a reference across entities does not result in a “dead
end.”
c. Cardinal rules
11
A cardinal rule defines the constraints on relationship cardinality. Cardinal rules are not
to be confused with reference rules. Whereas reference rules are concerned with the
identity of the occurrences in referenced entities, cardinal rules define the allowed
number of such occurrences.
d. Inheritance rules
An inheritance rule expresses integrity constraints on entities that are associated
through generalization and specialization, or more technically through sub- typing.
A currency rule enforces the desired “freshness” of the historical data. Currency rules
are usually expressed in the form of constraints on the effective date of the most
recent record in the history. For example if the status of an individual under
surveillance is 'Current', then the last visit date should be no earlier than the start of
the previous surveillance round.
Currency Residenc
y
Episodes
Current 62621
Not Current 2384 Indicator=1-NotCurrentTotal
Total 65005
Indicator 96.3%
Example 2 : At year end, last status observation should not be prior than 1 July of that
year (older than 183 days)
b. Retention Rule
A retention rule enforces the desired depth of the historical data. Retention rules are
usually expressed in the form of constraints on the overall duration or the number of
records in the history.
c. Granularity rule
E.g. If the surveillance implies a six monthly visit to each homestead, is that in fact the
case?
12
d. Continuity rule
A continuity rule prohibits gaps and overlaps in accumulator histories. Continuity rules
require that the beginning date of each measurement period immediately follows the
end date of the previous period.
For example for internal migrations, the next residency episode must follow directly on
the previous.
Continuity n
Continuity 18 657
Discontinuit Indicator=1-DiscontinuityTotal
y 6 430
Total 25 087
Indicator 74.4%
A timestamp pattern rule requires all timestamps to fall into a certain prescribed date
interval, such as every March or every other Wednesday or between the first and fifth
of each month. Occasionally the pattern takes the form of minimum or maximum
length of time between measurements. For example, participants in a medical study
may be required to take blood pressure readings at least once a week. While the
length of time between particular measurements will differ, it has to be no longer than
seven days.
Example : Similar to granularity rule, homestead has to be visited at least once every
six months.
Locatio
Semester Visits ns
Visited 194,238
Not Visited 19,488 Indicator=1-NotVisitedTotal
Total 213,726
Indicator 90.9%
Note : Care should be taken with the type of observations used to derive this measure.
If for example only observation tied to residency and status observations are
considered, those locations visited where no observation was recorded due to non-
contact with the occupants will not be considered in this indicator.
Value histories for time-dependent attributes usually also follow systematic patterns. A
value pattern rule utilizes these patterns to predict reasonable ranges of values for
each measurement and identify likely outliers. Value pattern rules can restrict
direction, magnitude, or volatility of change in data values.
i. Direction of Change
The simplest value pattern rules restrict the direction in value changes from
measurement to measurement. A person's length is unlikely to decrease over
multiple measures in time, same for educational attainment.
13
Direction Measur
es
Invalid 21,178
Indicator=1-InvalidTotal
Valid 128,613
Total 149,791
Indicato
85.9%
r
ii.Magnitude of Change
Direction Measur
es
Valid 117,626
Invalid
Direction
21,181 Indicator=1-
Invalid
10,984
InvalidDirection+InvalidMagnitudeTotal
Magnitude
Total 149,791
Indicator 78.5%
Various events often affect the same objects and therefore may be
interdependent. Data quality rules can use these dependencies to validate the
event histories. E.g. An out migration event cannot be recorded for an
individual without a prior birth or in- migration event.
Dependen n
cy
56,
Correct
817 Indicator=1-IncorrectTotal
Incorrect
3
56,
Total
820
99.99
Indicator
%
ii.Event Conditions
Events of many kinds do not occur at random but rather only happen under
certain unique circumstances. Event conditions verify these circumstances.
Example: Birth spacing, the time between two subsequent pregnancies with a
live birth outcome should not be less than 9 months (280 days).
14
Indicator 98.7%
Events themselves are often complex entities, each with numerous attributes.
Birth Pregnanci
Spacing es
Valid 70,601
Invalid 1,285
Total 71,886
Indicator 98.2%
These rules place constraints on the lifecycle of objects described by so- called state-
transition models.
Vis
it
Census
Inmigration
location)
e rn
al
Ou
tm
Outmigration
igra
In
tio
te
n
rn
al
In m
Under
ig
surveillance
ra
tio
(unknown
n
location)
15
A state domain constraint limits the set of allowed states to only those shown
in the state- transition model. Invalid states are usually typos inside otherwise
valid records. The true state can often be deduced based on the action value.
An action domain constraint limits the set of allowed actions to only those shown in the
state-transition model. Invalid actions are usually typos inside otherwise valid records.
The true action can often be deduced based on the state value.
A terminator domain constraint limits the set of allowed terminators, specifically states
in which an object can start and end its life cycle. Invalid terminators often are a
symptom of missing records at the beginning of the life cycle.
To State Actio n
n
INV HMS 1,838
INV INM 51
INV INT 9,205
SLK DLV 16,430
SLK DSS 62,633
SLK INM 34,500
124,65
Total
7
Indicato
91.1%
r
d. State-transition constraints
These constraints limit state changes to those allowed by the state- transition model.
For example, a person who is already out- migrated cannot be out-migrated again
without being in- migrated in between. Invalid state-transitions often signify a missing
action.
Final Individual
State s
Invalid 16,409
Valid 108,248 Indicator=1-InvalidEndStateIndividuals
Total 124,657
Indicato
r
86.8% Indicator=1-InvalidTransitionTransitions
State Transition
s
Invalid 16,409
Valid 296,501
Total 312,910
Indicato
94.8%
r
16
Invalid Transition Causes
e. State-action constraints
Require that each action is consistent with the change in the object state. For example,
after an out migration, the state of an individual must be non-resident
f. Continuity rules
Prohibit gaps and overlaps in state-transition history. In other words, they require that
the effective date of each state record must immediately follow the end date of the
previous state record.
g. Duration rules
Put a constraint on the maximum and/or minimum length of time an object can stay in
any specific state. The simplest form of the duration rule is the zero-length rule, which
requires the length of time spent in each state to be greater than zero.
Example : Residency episode duration cannot be negative (end before start) or zero.
Total 170,653
Indicator 99.8%
h. Action pre-conditions
The conditions that must be satisfied before an action can take place. E.g. Mother must
be resident for a child to start residency with birth
Example : Mother’s state at child residency start, if child starts residency with
delivery.
Mothers Children
Mother resident 14,696
17
Mother non-
1,619 Indicator=1-
resident
Mothernonresident+MotherunknownChildr
Mother unknown 131
en
Total 16,446
Indicator 89.4%
i. Action post-conditions
These are the conditions that must be satisfied after the action is successfully
completed.
a. Redundant attributes
Redundant attributes are data elements that represent the same attribute of a real
world object. While attribute redundancy goes against basic data modelling principles,
it is common in practice for several reasons. First, redundancy is widespread in
“legacy” databases and certain systems that were converted from the “legacy”
databases. Secondly, redundancy is often used even in modern relational databases to
improve efficiency of data access, information presentation, and transaction
processing. Finally, some data across different systems are invariably redundant.
Comparison of redundant attributes is a sure way to identify (and eventually correct)
numerous data problems.
Example: Link between mother and child explicit via MotherID and implicit via births
and pregnancies, both these should be consistent.
Link Pairs
Linked 15746 Of the cases where residency start is Birth and it is linked to a
Pregnancy, in one case this link between child and mother was not
Not reflected in the MotherID of the child.
1
linked
Indicator=1-NotLinkedPairs
Total 15747
Indicato 99.99
r %
The converse is slightly more complex. Of the children born to the mother while she
was resident, are all such children recorded as resident and the residency start marked
as Birth? Whether this test is absolute will depend on the eligibility rules of the HDSS.
Link Pairs
Birth not linked 1750 Child resident by birth is not linked to a resident mother via a
pregnancy.
Birth not Resident mother gave birth to a child that is not resident from
1102
resident birth.
Consistent 14696
Total 17548
Indicator=1-
BirthsNotLinked+BirthsNotResidentPairs
Indicator 83.7%
b. Derived Attributes
Values of derived attributes are calculated based on the values of some other
attributes. This approach is very common when the calculation is rather complex and
involves data stored in multiple records of possibly multiple entities. Performing the
calculation on the fly is then very inefficient. One of the most common special cases of
derived attribute constraints is a balancing rule, which requires an aggregate attribute
to equal the total of atomic level attribute values.
18
Example : Data should satisfy the demographic equation:
Populationt+1=Populationt+Birthst-Deathst+(Immigrationt-Emigrationt)
Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Population 0 66035 68027 67277 65785 64981 64916 65376 64289 63494
Start observation 62633 0 0 0 0 0 0 0 0 0
Births 1675 1723 1719 1641 1743 1748 1749 1692 1579 1101
Immigration 3887 5689 7033 6032 5111 5348 5511 5242 5279 4992
Deaths 886 1077 1098 1129 983 979 886 913 796 697
Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427
Population t+1 65386 67513 67487 66592 65452 65060 65530 65226 64649 63463
Balance -649 -514 210 807 471 144 154 937 1155
Indicator 99.0% 99.2% 99.7% 98.8% 99.3% 99.8% 99.8% 98.6% 98.2%
Provision made for contextual factors such as change in HDSS boundary and loss to
follow-up:
Component Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009
Population 0 66033 68025 67266 65654 64263 63436 65376 64289 63494
Start observation 62633 0 0 0 0 0 2239 0 0 0
Births 1675 1723 1719 1640 1729 1696 1685 1692 1579 1101
Immigration 3885 5689 7027 5974 4940 5069 5172 5242 5279 4992
Deaths 886 1077 1098 1129 983 979 886 913 796 697
Emigration 1923 4857 8194 7229 6204 6038 5760 6171 5702 5427
Loss to Follow-up 56 50 96 78 93 67 204 538 635 36233
Population t+1 65328 67461 67383 66444 65043 63944 65682 64688 64014 27230
Balance -705 -564 117 790 780 508 306 399 520
Indicator 98.9% 99.2% 99.8% 98.8% 98.8% 99.2% 99.5% 99.4% 99.2%
The values of redundant and derived attributes are prescribed exactly by the
dependency. Oftentimes, the relationships between attributes are not so exact. The
value of one attribute may restrict possible values of another attribute to a smaller
subset, but not to a single value.
Example : Certain causes of death are only possible for women and/or men, e.g.
cancer of the cervix or causes related to maternal death.
Sex n
FEM 120 Causes of death that ought to be associated with women.
MAL 1
Indicator=1-MaleDeathsDeathsFemaleCauses
Total 121
Indicato 99.2
r %
d. Conditional Optionality
e. Correlated Attributes
19
Values of one attribute can change the likelihood of values of another one, though not
firmly restricting any possibilities. An example is the correlation between gender and
first name. The majority of names are distinctly male or female. Thus there is a definite
relationship between these attributes; however, the relationship is not exact in nature.
1.1.1 Content
1. Role Players
a. Data Collectors
b. Data Custodians
c. Data Consumers
2. Total Data Quality Management Cycle
a. Define
b. Measure
c. Analyse
d. Improve
20
Appendix A : Sample Database
ResidentEpisode ResidentEpisode
Location ResidentEpisode
Pregnancy DeathCause
Latitude Individual
Birthweight DeathLocation
Longitude Location
StartDate StartDate
Individual
LastName
Pregnancy
FirstName
Individual
Sex
StartDate
DoB
FirstObservation Observation
EndDate
EndDate Location
MotherID
TerminatingEventType CensusRounds CensusRound
FatherID StatusObservations
LastObservation ObservationDate
StillBorn Observer
LiveBorn ObservationType
BirthAttendant
BirthLocation
CensusRound
StatusObservationID
StartDate
Individual
EndDate
Observation
MaritalStatus
EducationLevel
21
Appendix B : SQL Scripts
--
--region Attribute Domain Constraints
--
--region Optionality Constraints
--
--region Cause of Death example
--
SELECT
DeathCause,
COUNT(*) n
FROM dbo.Deaths D
GROUP BY DeathCause
ORDER BY DeathCause --COUNT(*) DESC
--
SELECT
DeathCause,
MAX(C.Description) Description,
COUNT(*) n
FROM dbo.Deaths D
LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY DeathCause
ORDER BY DeathCause
--
-- Final formulation
--
SELECT
CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END Cause,
COUNT(*) n
FROM dbo.Deaths D
LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END
--
-- Data Quality Trend
--
SELECT
YEAR(E.EndDate) Year,
CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END Cause,
COUNT(*) n
FROM dbo.Deaths D
JOIN dbo.ResidentEpisodes E ON D.ResidentEpisode=E.ResidentEpisode
LEFT JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY YEAR(E.EndDate),
CASE
WHEN DeathCause IS NULL THEN 'Null'
WHEN DeathCause<'A' THEN 'Unassigned'
ELSE 'Assigned'
END
ORDER BY YEAR(E.EndDate),Cause
--endregion
--region Complex example - Internal migration destination
-- Destination location for internal migrations
SELECT
DestinationLocation, COUNT(*) n
FROM dbo.OutMigrations OM
22
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode
WHERE TerminatingEventType='INT'
GROUP BY DestinationLocation
ORDER BY COUNT(*) DESC
--
-- Grouped Destination
--
SELECT
CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
ELSE 'Known'
END Destination,
COUNT(*) n
FROM dbo.OutMigrations OM
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode
WHERE TerminatingEventType='INT'
GROUP BY CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
ELSE 'Known'
END
--
-- Further Investigation
--
SELECT
CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
WHEN L.Location IS NULL THEN 'Location wrong'
ELSE 'Known'
END Destination,
COUNT(*) n
FROM dbo.OutMigrations OM
JOIN dbo.ResidentEpisodes RE ON OM.ResidentEpisode=RE.ResidentEpisode
LEFT JOIN dbo.Locations L ON OM.DestinationLocation=L.Location
WHERE TerminatingEventType='INT'
GROUP BY CASE
WHEN DestinationLocation=999998 THEN 'Unknown'
WHEN DestinationLocation IS NULL THEN 'Null'
WHEN L.Location IS NULL THEN 'Location wrong'
ELSE 'Known'
END
--endregion
--endregion
--region Format Constraints
SELECT
COUNT(*) Total,
SUM(CASE WHEN PATINDEX('%[^a-zA-Z '']%',LastName)>0 THEN 1 ELSE 0 END) Invalid
FROM dbo.Individuals
--endregion
--region Valid Value Constraits
SELECT
InitiatingEventType,
COUNT(*) n
FROM dbo.ResidentEpisodes
GROUP BY InitiatingEventType
--
SELECT
YEAR(StartDate) Yr,
CASE
WHEN InitiatingEventType='HMS' THEN 'Invalid'
ELSE 'Valid'
END Validity,
COUNT(*) n
FROM dbo.ResidentEpisodes
GROUP BY YEAR(StartDate),
CASE
23
WHEN InitiatingEventType='HMS' THEN 'Invalid'
ELSE 'Valid'
END
ORDER BY Yr,Validity
--
-- Birth Weight
--
SELECT
Birthweight/100 W100q,
COUNT(*) n
FROM dbo.Births B
JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)
WHERE StartDate BETWEEN '20000101' AND '20101231'
GROUP BY Birthweight/100
ORDER BY Birthweight/100
--endregion
--region Precision and Granularity Contraints
--region Date of Birth
-- Birth Date
SELECT
StartPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DLV'
GROUP BY StartPrecision
ORDER BY StartPrecision
--endregion
--region Complex example Migration Date Precision
--
-- InMigration
SELECT
StartPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='INM'
GROUP BY StartPrecision
ORDER BY StartPrecision
--
-- Internal InMigration
SELECT
StartPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='INT'
GROUP BY StartPrecision
ORDER BY StartPrecision
--
-- OutMigration
SELECT
EndPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='OTM'
GROUP BY EndPrecision
ORDER BY EndPrecision
--
-- Internal OutMigration
SELECT
EndPrecision,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='INT'
GROUP BY EndPrecision
ORDER BY EndPrecision
--
-- Migration Precision by Time
--
WITH InPrecision AS (
24
SELECT
YEAR(StartDate) Yr,
StartPrecision [Precision],
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType IN ('INM','INT')
GROUP BY YEAR(StartDate),StartPrecision
),
OutPrecision AS (
SELECT
YEAR(EndDate) Yr,
EndPrecision [Precision],
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType IN ('INT','OTM')
GROUP BY YEAR(EndDate),EndPrecision
),
InScore AS (
SELECT
Yr,
SUM((10-[Precision])*n) Score,
SUM(9*n) MaxScore
FROM InPrecision
GROUP BY Yr
),
OutScore AS (
SELECT
Yr,
SUM((10-[Precision])*n) Score,
SUM(9*n) MaxScore
FROM OutPrecision
GROUP BY Yr
)
SELECT
I.Yr,
SUM(ISNULL(I.Score,0)+ISNULL(O.Score,0)),
SUM(ISNULL(I.MaxScore,0)+ISNULL(O.MaxScore,0))
FROM InScore I
JOIN OutScore O ON (I.Yr=O.Yr)
GROUP BY I.Yr
ORDER BY I.Yr
--endregion
--endregion
--endregion
--
--region Relational Integrity Constraints
--region Identity Rules
-- Duplicate Individuals
SELECT
*
INTO IndividualComparison
FROM dbo.udfSeekDuplicates()
--
SELECT
Similarity,
COUNT(*) n
FROM dbo.IndividualComparison
GROUP BY Similarity
ORDER BY Similarity
--
SELECT
C.IndA,C.IndB,
I1.FirstName FirstNameA, I2.FirstName FirstNameB,
I1.LastName LastNameA, I2.LastName LastNameB,
I1.Sex SexA, I2.Sex SexB,
I1.DoB DoBA, I2.DoB DoBB
FROM dbo.IndividualComparison C
JOIN dbo.Individuals I1 ON (C.IndA=I1.Individual)
25
JOIN dbo.Individuals I2 ON (C.IndB=I2.Individual)
WHERE C.Similarity=0
ORDER BY C.IndA,C.IndB
--region AC Specific
SELECT
C.IndA,C.IndB,
I1.Name NameA, I2.Name NameB,
I1.Sex SexA, I2.Sex SexB,
I1.DoB DoBA, I2.DoB DoBB
FROM dbo.IndividualComparison C
JOIN ACDIS.dbo.vacNamedIndividuals I1 ON (C.IndA=I1.IIntID)
JOIN ACDIS.dbo.vacNamedIndividuals I2 ON (C.IndB=I2.IIntID)
WHERE C.Similarity=0
ORDER BY C.IndA,C.IndB
--endregion
SELECT
COUNT(*)
FROM dbo.Individuals
--endregion
--region Reference Rules
--
-- Child to Parent linkages
-- MotherId on Child
SELECT
CASE
WHEN C.MotherID IS NULL THEN 'Unknown'
WHEN M.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END Mother,
COUNT(*) n
FROM dbo.Individuals C
LEFT JOIN dbo.Individuals M ON (C.MotherID=M.Individual)
GROUP BY CASE
WHEN C.MotherID IS NULL THEN 'Unknown'
WHEN M.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END
-- FatherId on Child
SELECT
CASE
WHEN C.FatherID IS NULL THEN 'Unknown'
WHEN F.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END Father,
COUNT(*) n
FROM dbo.Individuals C
LEFT JOIN dbo.Individuals F ON (C.FatherID=F.Individual)
GROUP BY CASE
WHEN C.FatherID IS NULL THEN 'Unknown'
WHEN F.Individual IS NULL THEN 'Missing'
ELSE 'Known'
END
--endregion
--region Cardinal Rules
-- Incorrect formulation
SELECT
CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END Residency,
COUNT(*) n
FROM dbo.Individuals I
LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
GROUP BY CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END
--region Correct formulation
26
WITH UniqueResidencies AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
)
SELECT
CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END Residency,
COUNT(*) n
FROM dbo.Individuals I
LEFT JOIN UniqueResidencies R ON (I.Individual=R.Individual)
GROUP BY CASE
WHEN R.Individual IS NULL THEN 'None'
ELSE 'Exists'
END
--
-- Residency Cardinality
--
WITH ResidencyCount AS (
SELECT
I.Individual,
COUNT(*) n
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
GROUP BY I.Individual
UNION
SELECT
I.Individual,
0 n
FROM dbo.Individuals I
LEFT JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
WHERE R.Individual IS NULL
)
SELECT
n ResidencyCardinality,
COUNT(*) Cnt
FROM ResidencyCount
GROUP BY n
ORDER BY n
--endregion
--endregion
--endregion
--
--region Rules for Historical Data
--region Currency Rule
--
-- Last visit of current residency episodes
SELECT
CensusRound,
MIN(ObservationDate) MinDate,
MAX(ObservationDate) MaxDate
FROM dbo.Observations
GROUP BY CensusRound
ORDER BY CensusRound
--
-- Start of previous round 13 Jul 2009
SELECT
CASE
WHEN EndDate>'20090712' THEN 'Current'
ELSE 'Not Current'
END Currency,
COUNT(*) n
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='VIS'
GROUP BY CASE
WHEN EndDate>'20090712' THEN 'Current'
27
ELSE 'Not Current'
END
--
-- Currency of Statusobservation, e.g. MaritalStatus
--
WITH YearEnds AS (
SELECT CAST('20001231' AS datetime) YearEnd
UNION
SELECT CAST('20011231' AS datetime) YearEnd
UNION
SELECT CAST('20021231' AS datetime) YearEnd
UNION
SELECT CAST('20031231' AS datetime) YearEnd
UNION
SELECT CAST('20041231' AS datetime) YearEnd
UNION
SELECT CAST('20051231' AS datetime) YearEnd
UNION
SELECT CAST('20061231' AS datetime) YearEnd
UNION
SELECT CAST('20071231' AS datetime) YearEnd
UNION
SELECT CAST('20081231' AS datetime) YearEnd
UNION
SELECT CAST('20091231' AS datetime) YearEnd
),
YearEndIndividuals AS (
SELECT DISTINCT
Individual,YearEnd
FROM dbo.ResidentEpisodes R
CROSS JOIN YearEnds
WHERE R.EndDate>=YearEnd
AND R.StartDate<YearEnd
),
SOCurrency AS (
SELECT
S.Individual,YearEnd,
MIN(DateDiff(day,O.ObservationDate,YearEnd)) Currency
FROM dbo.StatusObservations S
JOIN dbo.Observations O ON (S.Observation=O.Observation)
JOIN YearEndIndividuals I ON (S.Individual=I.Individual)
AND (O.ObservationDate<=I.YearEnd)
GROUP BY S.Individual,YearEnd
)
SELECT
I.YearEnd,
CASE
WHEN C.Currency IS NULL THEN 'Undefined'
WHEN C.Currency>183 THEN 'NotCurrent'
ELSE 'Current'
END Currency,
COUNT(*) n
FROM YearEndIndividuals I
LEFT JOIN SOCurrency C
ON (I.Individual=C.Individual) AND (I.YearEnd=C.YearEnd)
GROUP BY I.YearEnd,CASE
WHEN C.Currency IS NULL THEN 'Undefined'
WHEN C.Currency>183 THEN 'NotCurrent'
ELSE 'Current'
END
ORDER BY I.YearEnd,CASE
WHEN C.Currency IS NULL THEN 'Undefined'
WHEN C.Currency>183 THEN 'NotCurrent'
ELSE 'Current'
END
--endregion
--
--
28
--region Granularity Rule
WITH NumberedVisits AS (
SELECT
Location, CensusRound, ObservationDate,
ROW_NUMBER()
OVER(PARTITION BY Location, CensusRound
ORDER BY ObservationDate) RowNum,
COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt
FROM dbo.Observations
WHERE CensusRound BETWEEN 1 AND 21
AND ObservationDate BETWEEN '20000101' AND '20091231'
),
MidVisits AS (
SELECT
Location, CensusRound,
CAST(ObservationDate AS float) fDate,
RowNum, Cnt
FROM NumberedVisits
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
),
VisitGaps AS (
SELECT
R1.Location,
R1.CensusRound Rn,
R2.CensusRound Rnn,
DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity
FROM MedianVisits R1
JOIN MedianVisits R2
ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)
)
SELECT
Rn,Rnn,Granularity,COUNT(*) n
FROM VisitGaps
GROUP BY Rn,Rnn,Granularity
ORDER BY Rn,Rnn,Granularity
--
-- Quality indicator based on granularity
-- Gap should be +-15 days within 183 (twice yearly rounds)
--
SELECT
Rnn CensusRound,
CASE
WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange'
WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange'
ELSE 'Outside'
END Indicator,
COUNT(*) n
FROM dbo.vLocationVisitGaps
GROUP BY Rnn,
CASE
WHEN Rnn IN (4,6,7,8) AND Granularity BETWEEN 107 AND 137 THEN 'InRange'
WHEN Granularity BETWEEN 168 AND 198 THEN 'InRange'
ELSE 'Outside'
END
ORDER BY Rnn, Indicator
--endregion
29
--region Continuity Rule
WITH NumberedEpisodes AS (
SELECT
Individual,
StartDate,
InitiatingEventType,
EndDate,
TerminatingEventType,
ROW_NUMBER() OVER(PARTITION BY Individual ORDER BY StartDate) RowNum
FROM dbo.ResidentEpisodes
)
SELECT
YEAR(E2.StartDate) Yr,
CASE
WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext'
WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity'
ELSE 'Continuity'
END Continuity,
COUNT(*) n
FROM NumberedEpisodes E1
JOIN NumberedEpisodes E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
WHERE E1.TerminatingEventType='INT'
GROUP BY YEAR(E2.StartDate),
CASE
WHEN E2.InitiatingEventType<>'INT' THEN 'InvalidNext'
WHEN ABS(DATEDIFF(day,E1.EndDate,E2.StartDate))>1 THEN 'Discontinuity'
ELSE 'Continuity'
END
ORDER BY Yr, Continuity
--endregion
--region Timestamp pattern rule
WITH NumberedVisits AS (
SELECT
Location, CensusRound, ObservationDate,
ROW_NUMBER()
OVER(PARTITION BY Location, CensusRound
ORDER BY ObservationDate) RowNum,
COUNT(*) OVER(PARTITION BY Location, CensusRound) AS Cnt
FROM dbo.Observations
WHERE CensusRound BETWEEN 1 AND 21
AND ObservationDate BETWEEN '20000101' AND '20091231'
),
MidVisits AS (
SELECT
Location, CensusRound,
CAST(ObservationDate AS float) fDate,
RowNum, Cnt
FROM NumberedVisits
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
),
Semesters AS (
SELECT
1 AS Semester,
CAST('20000101' AS datetime) SemStart,
DATEADD(day,-1,DATEADD(quarter,2,'20000101')) SemEnd
30
UNION ALL
SELECT
Semester+1 Semester,
DATEADD(day,1,SemEnd) SemStart,
DATEADD(day,-1,DATEADD(quarter,2,DATEADD(day,1,SemEnd))) SemEnd
FROM Semesters
WHERE SemStart<'20090701'
),
SemesterVisits AS (
SELECT
Location,Semester,COUNT(*) n
FROM MedianVisits V
JOIN Semesters ON (MedianDate>=Semstart) AND (MedianDate<=SemEnd)
GROUP BY Location,Semester
)
SELECT
*
FROM SemesterVisits
ORDER BY Location,Semester
--endregion
--endregion
--region Value Pattern Rule
--region Direction of Change
--
-- Example : Educational Attainment
--
WITH EducationStatus AS (
SELECT
Individual,ObservationDate,Years,
ROW_NUMBER()
OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum
FROM dbo.StatusObservations SO
JOIN dbo.Observations O ON (SO.Observation=O.Observation)
JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel)
WHERE NOT E.Years IS NULL
)
SELECT
CASE
WHEN E2.Years>=E1.Years THEN 'Valid'
ELSE 'Invalid'
END Direction,
COUNT(*) Measures
FROM EducationStatus E1
JOIN EducationStatus E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
GROUP BY CASE
WHEN E2.Years>=E1.Years THEN 'Valid'
ELSE 'Invalid'
END
--
--endregion
--region Magnitude of Change
WITH EducationStatus AS (
SELECT
Individual,ObservationDate,Years,
ROW_NUMBER()
OVER(PARTITION BY Individual ORDER BY ObservationDate) RowNum
FROM dbo.StatusObservations SO
JOIN dbo.Observations O ON (SO.Observation=O.Observation)
JOIN dbo.EducationLevels E ON (SO.EducationLevel=E.EducationLevel)
WHERE NOT E.Years IS NULL
)
SELECT
CASE
WHEN E2.Years<E1.Years THEN 'Invalid Direction'
WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate)
THEN 'Invalid Magnitude'
ELSE 'Valid'
31
END Direction,
COUNT(*) Measures
FROM EducationStatus E1
JOIN EducationStatus E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
GROUP BY CASE
WHEN E2.Years<E1.Years THEN 'Invalid Direction'
WHEN (E2.Years-E1.Years)>DATEDIFF(year,E1.ObservationDate,E2.ObservationDate)
THEN 'Invalid Magnitude'
ELSE 'Valid'
END
--endregion
--endregion
--region Event History Rule
--region Event Dependencies
--
-- Out migration not preceded by Death, Visit or Outmigration
--
WITH Events AS (
SELECT
Individual,
InitiatingEventType Event,
StartDate EventDate,
ResidentEpisode
FROM dbo.ResidentEpisodes
WHERE StartDate<>Enddate
UNION ALL
SELECT
Individual,
TerminatingEventType Event,
EndDate EventDate,
ResidentEpisode
FROM dbo.ResidentEpisodes
WHERE StartDate<>Enddate
),
NumberedEvents AS (
SELECT
Individual,
Event,EventDate,
ROW_NUMBER()
OVER(PARTITION BY Individual ORDER BY EventDate, ResidentEpisode) RowNum
FROM Events
)
SELECT
CASE
WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect'
ELSE 'Correct'
END Dependency,
--E1.Event,
COUNT(*) n
FROM NumberedEvents E1
JOIN NumberedEvents E2
ON (E1.Individual=E2.Individual) AND (E1.RowNum=E2.RowNum-1)
WHERE E2.Event='OTM'
GROUP BY --E1.Event
CASE
WHEN E1.Event IN ('DTH','OTM','VIS') THEN 'Incorrect'
ELSE 'Correct'
END
--endregion
--region Event Conditions
--
-- Pregnancies with live births should be spaced by 9 months (280 days)
--
WITH NumberedPregnancies AS (
SELECT
Individual,EndDate DeliveryDate,
ROW_NUMBER()
32
OVER(PARTITION BY Individual ORDER BY EndDate) RowNum
FROM dbo.Pregnancies
WHERE LiveBorn>0
)
SELECT
CASE
WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort'
ELSE 'Valid'
END BirthSpacing,
COUNT(*) Pregnancies
FROM NumberedPregnancies P1
JOIN NumberedPregnancies P2
ON (P1.Individual=P2.Individual) AND (P1.RowNum=P2.RowNum-1)
GROUP BY CASE
WHEN DATEDIFF(day,P1.DeliveryDate,P2.DeliveryDate)<280 THEN 'TooShort'
ELSE 'Valid'
END
--endregion
--region Event-specific attribute constraints
SELECT
CASE
WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid'
ELSE 'Invalid'
END BirthSpacing,
COUNT(*) Pregnancies
FROM dbo.Pregnancies P
JOIN dbo.Individuals I ON (P.Individual=I.Individual)
GROUP BY CASE
WHEN dbo.fnacAgeYears(I.DoB,P.EndDate) BETWEEN 15 AND 49 THEN 'Valid'
ELSE 'Invalid'
END
--end region
--endregion
--endregion
--
--region Rules for state-dependent objects
--region State domain constraint
--endregion
--region Action domain constraint
--endregion
--region Terminator domain constraint
SELECT
ToState,Action,COUNT(*) n
FROM dbo.udfStateTransitions('20000101')
WHERE Transition=1
GROUP BY ToState,Action
ORDER BY ToState,Action
--endregion
--region State-transition constraints
--
-- Individuals with invalid end states
--
WITH LastTransition AS (
SELECT
Individual,MAX(Transition) LastTransition
FROM dbo.udfStateTransitions('20000101')
GROUP BY Individual
)
SELECT
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END Quality,
COUNT(*) Individuals
FROM dbo.udfStateTransitions('20000101') T
JOIN LastTransition LT
ON (T.Individual=LT.Individual) AND (T.Transition=LT.LastTransition)
GROUP BY
33
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END
--
-- Invalid transitions
--
SELECT
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END Quality,
COUNT(*) Transitions
FROM dbo.udfStateTransitions('20000101') T
GROUP BY
CASE
WHEN ToState='INV' THEN 'Invalid'
ELSE 'Valid'
END
--
-- Breakdown of invalid transitions
--
SELECT
InvalidReason,Action,
COUNT(*) n
FROM dbo.udfStateTransitions('20000101') T
WHERE ToState='INV'
GROUP BY InvalidReason, Action
ORDER BY InvalidReason, Action
--
-- Breakdown by surveillance round
--
SELECT
O.CensusRound,
SUM(CASE WHEN ToState='INV' THEN 1 ELSE 0 END) Invalid,
SUM(CASE WHEN ToState='INV' THEN 0 ELSE 1 END) Valid,
COUNT(*) Transitions
FROM dbo.udfStateTransitions('20000101') T
JOIN dbo.Observations O ON (T.Observation=O.Observation)
GROUP BY O.CensusRound
ORDER BY O.CensusRound
--endregion
--region State-action constraints
--endregion
--region Continuity rules
--endregion
--region Duration rules
--
-- Residency episode cannot be of zero or negative duration
--
SELECT
YEAR(StartDate) Yr,
CASE
WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid'
ELSE 'Invalid'
END Duration,
COUNT(*) Episodes
FROM dbo.ResidentEpisodes
GROUP BY YEAR(StartDate),
CASE
WHEN DATEDIFF(day,StartDate,EndDate)>0 THEN 'Valid'
ELSE 'Invalid'
END
ORDER BY Yr,Duration
--endregion
--region Action pre-conditions
WITH ResidentBabies AS (
SELECT
34
Individual Baby,StartDate
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DLV'
),
ResidentBabyMothers AS (
SELECT DISTINCT
B.Baby,
I.MotherID Mother
FROM dbo.Individuals I
JOIN ResidentBabies B ON (I.Individual=B.Baby)
WHERE NOT MotherID IS NULL
)
SELECT
CASE
WHEN BM.Baby IS NULL THEN 'Mother unknown'
WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident'
ELSE 'Mother resident'
END Mothers,
COUNT(*) Babies
FROM ResidentBabies B
LEFT JOIN ResidentBabyMothers BM ON (B.Baby=BM.Baby)
LEFT JOIN dbo.ResidentEpisodes RE
ON (BM.Mother=RE.Individual)
AND (RE.StartDate<=B.StartDate) AND (RE.EndDate>=B.StartDate)
GROUP BY
CASE
WHEN BM.Baby IS NULL THEN 'Mother unknown'
WHEN RE.ResidentEpisode IS NULL THEN 'Mother non-resident'
ELSE 'Mother resident'
END
--endregion
--region Action post-conditions
--endregion
--endregion
--
--region General attribute dependency rules
--region Redundant attributes
--
-- Are all cases where residency is started by birth
-- which is linked to a pregnancy and then to the mother,
-- also reflected in the MotherID link of the child?
--
WITH DirectMCLink AS ( --76898 pairs
SELECT
MotherID,
Individual ChildID
FROM dbo.Individuals
WHERE NOT MotherID IS NULL
),
IndirectMCLink AS ( --15747
SELECT DISTINCT
P.Individual MotherID,
R.Individual ChildID
FROM dbo.Pregnancies P
JOIN dbo.Births B ON (P.Pregnancy=B.Pregnancy)
JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)
)
SELECT
CASE
WHEN D.MotherID IS NULL THEN 'Not linked'
ELSE 'Linked'
END Link,
COUNT(*) Pairs
FROM IndirectMCLink I
35
LEFT JOIN DirectMCLink D
ON (I.MotherID=D.MotherID) AND (I.ChildID=D.ChildID)
GROUP BY
CASE
WHEN D.MotherID IS NULL THEN 'Not linked'
ELSE 'Linked'
END
--
-- Of the children born to the mother while she was resident,
-- are all such children recorded as resident
-- and the residency start marked as Birth?
WITH MotherBirths AS ( --21907
SELECT
MotherID,
Individual ChildID,
DoB
FROM dbo.Individuals
WHERE NOT MotherID IS NULL
AND DoB>='20000101' -- After start of DSS
),
BirthsDuringResidency AS ( --15798
SELECT
B.*
FROM MotherBirths B
JOIN dbo.ResidentEpisodes R
ON (R.Individual=B.MotherID)
AND (B.DoB>=R.StartDate)
AND (B.DoB<=R.EndDate)
),
ResidenciesFromBirth AS ( --16430
SELECT
Individual ChildID
FROM dbo.Births B
JOIN dbo.ResidentEpisodes R ON (B.ResidentEpisode=R.ResidentEpisode)
)
SELECT
CASE
WHEN A.ChildID IS NULL THEN 'Birth not linked'
WHEN B.ChildID IS NULL THEN 'Birth not resident'
ELSE 'Consistent'
END Link,
COUNT(*) Pairs
FROM BirthsDuringResidency A
FULL JOIN ResidenciesFromBirth B ON (A.ChildID=B.ChildID)
GROUP BY
CASE
WHEN A.ChildID IS NULL THEN 'Birth not linked'
WHEN B.ChildID IS NULL THEN 'Birth not resident'
ELSE 'Consistent'
END
ORDER BY Link
--endregion
--region Derived Attributes
--
-- Data should satisfy the demographic equation
--
-- Resident Population at start of year
SELECT
'Population' AS Component,
SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0
END) Y2000,
36
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0
END) Y2001,
SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0
END) Y2002,
SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0
END) Y2003,
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0
END) Y2004,
SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0
END) Y2005,
SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0
END) Y2006,
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0
END) Y2007,
SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0
END) Y2008,
SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0
END) Y2009
FROM dbo.ResidentEpisodes
UNION
SELECT
'Start observation' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DSS'
UNION
SELECT
'Births' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='DLV'
UNION
SELECT
'Immigration' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
37
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'
UNION
SELECT
'Deaths' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='DTH'
UNION
SELECT
'Emigration' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM dbo.ResidentEpisodes
WHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'
--
-- Taking into account contextual factors, such as change in DSS boundary
--
WITH CensoredEpisodes AS (
SELECT
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS'
ELSE R.InitiatingEventType
END InitiatingEventType,
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001'
ELSE R.StartDate
END StartDate,
R.EndDate,
R.TerminatingEventType
FROM dbo.ResidentEpisodes R
JOIN dbo.Locations L ON (R.Location=L.Location)
)
SELECT
'Population' AS Component,
SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0
END) Y2000,
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0
END) Y2001,
SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0
END) Y2002,
SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0
END) Y2003,
38
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0
END) Y2004,
SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0
END) Y2005,
SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0
END) Y2006,
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0
END) Y2007,
SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0
END) Y2008,
SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0
END) Y2009
FROM CensoredEpisodes
UNION
SELECT
'Start observation' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DSS'
UNION
SELECT
'Births' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DLV'
UNION
SELECT
'Immigration' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'
UNION
SELECT
'Deaths' AS Component,
39
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='DTH'
UNION
SELECT
'Emigration' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'
--
-- Taking into account contextual factors and loss to follow-up
--
WITH CensoredEpisodes AS (
SELECT
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN 'DSS'
ELSE R.InitiatingEventType
END InitiatingEventType,
CASE
WHEN L.Indlovu=1 AND R.StartDate<'20061001' THEN '20061001'
ELSE R.StartDate
END StartDate,
R.EndDate,
R.TerminatingEventType
FROM dbo.ResidentEpisodes R
JOIN dbo.Locations L ON (R.Location=L.Location)
)
SELECT
'Population' AS Component,
SUM(CASE WHEN StartDate<'20000101' AND EndDate>='20000101' THEN 1 ELSE 0
END) Y2000,
SUM(CASE WHEN StartDate<'20010101' AND EndDate>='20010101' THEN 1 ELSE 0
END) Y2001,
SUM(CASE WHEN StartDate<'20020101' AND EndDate>='20020101' THEN 1 ELSE 0
END) Y2002,
SUM(CASE WHEN StartDate<'20030101' AND EndDate>='20030101' THEN 1 ELSE 0
END) Y2003,
SUM(CASE WHEN StartDate<'20040101' AND EndDate>='20040101' THEN 1 ELSE 0
END) Y2004,
SUM(CASE WHEN StartDate<'20050101' AND EndDate>='20050101' THEN 1 ELSE 0
END) Y2005,
SUM(CASE WHEN StartDate<'20060101' AND EndDate>='20060101' THEN 1 ELSE 0
END) Y2006,
40
SUM(CASE WHEN StartDate<'20070101' AND EndDate>='20070101' THEN 1 ELSE 0
END) Y2007,
SUM(CASE WHEN StartDate<'20080101' AND EndDate>='20080101' THEN 1 ELSE 0
END) Y2008,
SUM(CASE WHEN StartDate<'20090101' AND EndDate>='20090101' THEN 1 ELSE 0
END) Y2009
FROM CensoredEpisodes
UNION
SELECT
'Start observation' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DSS'
UNION
SELECT
'Births' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='DLV'
UNION
SELECT
'Immigration' AS Component,
SUM(CASE WHEN YEAR(StartDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(StartDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(StartDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(StartDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(StartDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(StartDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(StartDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(StartDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(StartDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(StartDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE InitiatingEventType='INM' OR InitiatingEventType='HMS'
UNION
SELECT
'Deaths' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
41
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='DTH'
UNION
SELECT
'Emigration' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='OTM' OR TerminatingEventType='HDS'
UNION
SELECT
'Loss to Follow-up' AS Component,
SUM(CASE WHEN YEAR(EndDate)=2000 THEN 1 ELSE 0 END) Y2000,
SUM(CASE WHEN YEAR(EndDate)=2001 THEN 1 ELSE 0 END) Y2001,
SUM(CASE WHEN YEAR(EndDate)=2002 THEN 1 ELSE 0 END) Y2002,
SUM(CASE WHEN YEAR(EndDate)=2003 THEN 1 ELSE 0 END) Y2003,
SUM(CASE WHEN YEAR(EndDate)=2004 THEN 1 ELSE 0 END) Y2004,
SUM(CASE WHEN YEAR(EndDate)=2005 THEN 1 ELSE 0 END) Y2005,
SUM(CASE WHEN YEAR(EndDate)=2006 THEN 1 ELSE 0 END) Y2006,
SUM(CASE WHEN YEAR(EndDate)=2007 THEN 1 ELSE 0 END) Y2007,
SUM(CASE WHEN YEAR(EndDate)=2008 THEN 1 ELSE 0 END) Y2008,
SUM(CASE WHEN YEAR(EndDate)=2009 THEN 1 ELSE 0 END) Y2009
FROM CensoredEpisodes
WHERE TerminatingEventType='VIS'
--
-- Find 705 people present in 2001 in excess of expectations
--
WITH PresentIn2001 AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
WHERE StartDate<'20010101' AND EndDate>='20010101'
),
CameIn2000 AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
WHERE YEAR(StartDate)=2000
AND InitiatingEventType IN ('DSS','DLV','INM')
),
LeftIn2000 AS (
SELECT DISTINCT
Individual
FROM dbo.ResidentEpisodes
WHERE YEAR(EndDate)=2000
AND TerminatingEventType IN ('DTH','VIS','OTM')
)
SELECT
A.Individual
42
FROM PresentIn2001 A
JOIN LeftIn2000 B ON (A.Individual=B.Individual)
--
SELECT
*
FROM dbo.ResidentEpisodes
WHERE Individual=56179
--endregion
--region Partially Dependant Attributes
SELECT
D.DeathCause, C.Description,
COUNT(*) n
FROM dbo.Deaths D
JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
GROUP BY D.DeathCause,C.Description
ORDER BY n DESC
--
SELECT
I.Sex,
COUNT(*) n
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode)
WHERE DeathCause IN
('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57')
GROUP BY I.Sex
--
SELECT
I.Sex,C.Description
FROM dbo.Individuals I
JOIN dbo.ResidentEpisodes R ON (I.Individual=R.Individual)
JOIN dbo.Deaths D ON (R.ResidentEpisode=D.ResidentEpisode)
JOIN dbo.ICD10codes C ON (D.DeathCause=C.Code)
WHERE DeathCause IN
('C53','C50','C55','O72','O85','O15','O14','O75','C56','C57')
AND I.Sex='MAL'
--endregion
--endregion
43
AS BEGIN
DECLARE @Individual int
DECLARE @DoB datetime
DECLARE @InitiatingEventType char(3)
DECLARE @StartDate datetime
DECLARE @TerminatingEventType char(3)
DECLARE @EndDate datetime
DECLARE @Transition int
DECLARE @NextState char(3)
DECLARE @CurrentState char(3)
DECLARE @LastEvent char(3)
DECLARE @LastIndividual int
DECLARE @LastDate datetime
DECLARE @FirstObservation int
DECLARE @LastObservation int
OPEN C;
SET @LastIndividual=-1;
44
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action disallowed if not under surveillance', @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('DSS','INM','DLV')) BEGIN --Invalid action
condition
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action condition violated', @FirstObservation);
END
ELSE IF (@LastDate>@StartDate) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Temporal integrity violated', @FirstObservation);
END
ELSE BEGIN --Invalid event
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Invalid action', @FirstObservation);
END;
END;
IF (@CurrentState='SLK') BEGIN
IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency', @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('INT','INM','DLV','DSS')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency if already at known location',
@FirstObservation);
END
ELSE BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Invalid action', @FirstObservation);
END
END;
IF (@CurrentState='SLU') BEGIN
IF (@InitiatingEventType='INT' AND @LastDate<=@StartDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate, @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('VIS','OTM','DTH')) BEGIN
45
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency', @FirstObservation);
END
ELSE IF (@InitiatingEventType IN ('INM','DLV','DSS')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Action cannot start a residency if at unknown location',
@FirstObservation);
END
ELSE IF (@LastDate>@StartDate) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Temporal integrity violated', @FirstObservation);
END
ELSE BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventTyp
e,@StartDate,'Invalid action', @FirstObservation);
END
END;
IF (@CurrentState='DTH') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@InitiatingEventType,
@StartDate,'No transitions after terminating state', @FirstObservation);
END;
SET @LastDate=@StartDate;
SET @CurrentState=@NextState;
SET @Transition=@Transition+1;
-- Do end event transition
IF (@CurrentState='NUS') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType
,@EndDate,'Cannot be not under surveillance before residency end',
@LastObservation);
END;
IF (@CurrentState='SLK') BEGIN
IF (@TerminatingEventType='INT' AND @LastDate<@EndDate) BEGIN
SET @NextState='SLU';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
END
ELSE IF (@TerminatingEventType='OTM' AND @LastDate<@EndDate) BEGIN
SET @NextState='NUS';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
46
END
ELSE IF (@TerminatingEventType='VIS' AND @LastDate<=@EndDate) BEGIN
SET @NextState='SLK';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
END
ELSE IF (@TerminatingEventType='DTH' AND @LastDate<=@EndDate) BEGIN
SET @NextState='DTH';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate, Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate, @LastObservation);
END
ELSE IF (@TerminatingEventType IN ('INM','DSS','DLV')) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate,'Action cannot end a residency', @LastObservation);
END
ELSE IF (@LastDate>=@EndDate) BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate,'Temporal integrity violated', @LastObservation);
END
ELSE BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventTy
pe,@EndDate,'Invalid action', @LastObservation);
END
END;
IF (@CurrentState='SLU') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType
,@EndDate,'Cannot be at unknown location before residency end', @LastObservation);
END;
IF (@CurrentState='DTH') BEGIN
SET @NextState='INV';
INSERT INTO @Transitions
(Individual,Transition,FromState,ToState,Action,TransitionDate,InvalidReason,
Observation)
VALUES(@Individual,@Transition,@CurrentState,@NextState,@TerminatingEventType
,@EndDate,'Cannot be dead before residency end', @LastObservation);
END;
SET @LastDate=@EndDate;
SET @CurrentState=@NextState;
FETCH C INTO @Individual, @DoB,
@InitiatingEventType, @StartDate, @FirstObservation,
@TerminatingEventType, @EndDate, @LastObservation
END;
CLOSE C;
DEALLOCATE C;
RETURN
END
47
CREATE FUNCTION dbo.udfSeekDuplicates ()
RETURNS @Duplicates TABLE (
IndA int,
IndB int,
Similarity int
)
AS BEGIN
DECLARE @Individual int
DECLARE @LastName varchar(50)
DECLARE @FirstName varchar(50)
DECLARE @Sex char(3)
DECLARE @DoB datetime
OPEN C;
RETURN
END
48
WHERE RowNum IN ((Cnt + 1) / 2, (Cnt + 2) / 2)
),
MedianVisitDate AS (
SELECT
Location, CensusRound,
AVG(fDate) mDate
FROM MidVisits
GROUP BY Location, CensusRound
),
MedianVisits AS (
SELECT
Location, CensusRound, CONVERT(datetime,mDate) MedianDate
FROM MedianVisitDate
)
SELECT
R1.Location,
R1.CensusRound Rn,
R2.CensusRound Rnn,
DATEDIFF(day,R1.MedianDate,R2.MedianDate) Granularity
FROM MedianVisits R1
JOIN MedianVisits R2
ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)
49
JOIN MedianVisits R2
ON (R1.Location=R2.Location) AND (R1.CensusRound=R2.CensusRound-1)
50