Sie sind auf Seite 1von 19

Data Warehousing and Mining

(PDAM)
Submission1

NHS Data warehouse system


Design & Implementation
Table of Contents
Description ..................................................................................................................................................................... 2
Data warehouse schemas ............................................................................................................................................... 4
Schema 1 – Fact table [Treatment Cost] ................................................................................................................... 4
Schema 2 – Fact table [Drug Costs] .......................................................................................................................... 5
Queries and outputs ....................................................................................................................................................... 6
Query no 1.................................................................................................................................................................. 6
Query no 2.................................................................................................................................................................. 8
Query no 3.................................................................................................................................................................. 8
Query no 4.................................................................................................................................................................. 9
Query no 5................................................................................................................................................................ 10
Query no 6................................................................................................................................................................ 11
Bibliography ................................................................................................................................................................. 12
Appendices ................................................................................................................................................................... 13

PDAM [Submission 1] Page 1 of 18


NHS Data Warehouse System
Description
The star schema was preferred over a snowflake schema because usually is not recommended in dimensional
model environment. A data warehouse based on a snowflake schema may suffer a performance decrease (more tables
will need to be joined to perform the query) and it may be more difficult to understand the dimensional model. Despite
the fact that a snowflake schema may reduce disk space, the query performance impact outweighs the space saving
(Rad, 2014). Typically a normalisation of the parent table (snowflake) is required when the parent dimension is
populated by different sources or are two (or more) sets of attributes which define the information at different levels
(grains) (Nguyen, Tjoa, & Trujillo, 2005).
To preserve the natural source system IDs, as suggested by IBM (Ballard, Farrell, Gupta, Mazuela, & Vohnik,
2006) and to avoid ID conflicts coming from different sources, a surrogate key was used for all dimensions. For
example, a clinic can hold staff id as an INT (e.g.123), and a hospital may hold staff IDs as a VARCHAR (e.g. SID-123).
Therefore, using unique surrogate keys for dimensions will preserve original ID and will not have any impact on
DWH if the record source ID modifies its structure. At the same time ensure that each record is unique in the
dimension table because is no guarantee that the natural PKs are unique.
For both schemas, date records were automatically generated using a simple SQL INSERT on SELECT
(Figure 1) to populate the date dimension table (Figure 2) with all the dates on last 10 years (01/01/2009 –
31/12/2018). The date dimension has as attributes full date, timestamp, day of the week, month, day, year and week
number of the year. If the business model require more attributes (e.g. weekends, holidays or financial years) these can
be generated and implemented in the same method.

Figure 1 – SQL to automatically populate date dimension table with last 10 years

A time attribute (H:M:S) was not


considered for this table because only
for one year (365 days/rows) adding
time, would increase the row numbers
to 31536000 (365days x 24 hours x 60
min x 60 sec). However, if a time
attribute is a mandatory requirement
for a business this can be handled by
adding a time dimension, separate of
date dimension.

Figure 2 - Date dimension output of INSERT For second schema (Figure 8) the
fact subject is the total number of
drugs and the cost. The fact table (Figure 3) is structured based on the following reasoning:

PDAM [Submission 1] Page 2 of 18


Date (date_key 1~n); location (location_ 1~n); drug (drug_key 1~n, where drug price and details are stored);
prescription (prescription_key 1~n) where a prescription can contain multiple drugs in different quantities. For
example, prescription_1 belongs only to patient_1 where drugID_3 was prescribed twice and drugID_1 was
prescribed once (total_no_drug) with a total price of £200 (drugID_3 x 2) and £50 (drugID_1 x 1). In Figure 6 we can
see that drugID_1 price = 50, drugID_2 price = 75 and drugID_3 price = 100. In this way every record is unique.
In Figure 5 is the output from a larger data sample where the drug_cost multiplied with total_no_drug is correct
calculated in the total_cost drug column (e.g. £50*3 = £150).

Figure 3 - Schema 2 fact table

Figure 4 - Fact table connection logic

Figure 6 - Schema 2 drug table

Figure 5 - Larger sample output for second schema

PDAM [Submission 1] Page 3 of 18


Data warehouse schemas
Schema 1 – Fact table [Treatment Cost]

Figure 7 - Star schema (Treatments)

PDAM [Submission 1] Page 4 of 18


Schema 2 – Fact table [Drug Costs]

For this schema patient type


was moved in prescription
dimension, because each
prescription is unique for each
patient. A patient can be in day
1 IN and day 2 OUT (or on any
other prescription in same day).

Figure 8 - Star Schema (Drugs)


PDAM [Submission 1] Page 5 of 18
Queries and outputs
Query no 1
This query is listing a specific county area with separate cities, breaking down in descending order on years,
with the count of total number of treatments (per city/year) and total costs (per city/year). Simultaneously, is
calculated the total number of treatments per county/year and total costs per county/year. The last row is the county
total number of treatments and total costs (over 10 years). The query was formatted to hide null values and to add
£ sign for the output cost. For editable text of the query please see appendix A.

Figure 9 - Query 1
In Figure 11 we can observe the output from a sample data implementation with 2 counties, 2 cities and 1
treatment per year, and few records in order to verify the output calculations.
In Figure 12 is the same query but with a sample data of 3 counties, 10 cities and tens of treatments per year
with over 10.000 records. In the sample I used one unit per city but because the unit_key is unique implementing more
than one unit per city won’t affect the result (Figure 10).

Figure 10 - Unit location data sample

PDAM [Submission 1] Page 6 of 18


NOTE: For a better display of results, I would make use of
GROUPING function in MySQL, instead of COALESCE but the
GROUPING function is supported only from MySQL 8.0 version
onwards (MySQL.com, 2018). My testing environment was setup
with version 5.6

Figure 11 - Hampshire County over 10 years with 2 cities


and 1 treatment per year

PDAM [Submission 1] Page 7 of 18


Query no 2
Second query (Figure 13) is to return results of last 5 years from current date with all male outpatients treated
in Portsmouth clinics and calculate the number of treatments per year, along with total cost for the year and the
average cost per treatment, rounded at 2 decimals. Query use ROLLUP to calculate the total number of treatments,
costs, and average treatment cost over 5 years, sorting on descending order (Figure 14).

Figure 13 - Query 2

Figure 14 - Query 2 results

Query no 3
The following query (Figure 16) is listing the top 10 units from two counties, with the highest income over the
year. The output is listed in DESC order on the average treatment cost. Also the query returns the highest and lowest
cost of an individual treatment. In Figure 16 we can see the treatment table data where lowest cost is 100 and highest
is 500.

PDAM [Submission 1] Page 8 of 18


Figure 17 - Query 3 output

Query no 4
The following query is returning the total number of patients (Figure 18) sorted as
patient type (Inpatient/Outpatient). The query calculates the total times when a patient was
admitted into the hospital, the total number of drugs prescribed for that patient, total price of
drugs for each patient and the grand total.

Figure 18 - Total patients

Figure 19 - Query 4

PDAM [Submission 1] Page 9 of 18


Query no 5
In my schema I’m assuming that NHS will record the prescription type (e.g. free, exempted or paid). The 5 th
query (Figure 21) calculates the total amount of individual prescription type, breaking down on years and months. At
the same time returns the average price/drug for each year and the grand total for all years.

Figure 21 - Query 5

PDAM [Submission 1] Page 10 of 18


Query no 6

The following query (Figure 23) calculates the changes in percentage for given two years. It is grouping the
drug type (category), SUMs the total number of issued drugs per type/year and total costs for the year and calculates
if it is any increase/decrease of total number and total cost. The change percentage is calculated applying following
formula: C% = 2ndValue – 1stValue * 100 ÷ 1stValue
(e.g. 2009’s Calcium total number of drugs is 5; 2010’s Calcium total number of drugs is 17; Applying above formula
will result: 17-5*100/5=240 which means a 240% increase. If the number is positive, is an increase, if it is negative is
a decrease). The correct calculations can be observed in the output screenshot (Figure 24).
Note: This query can be improved and simplified by using common table expression syntax WITH (CTE) but
this functionality is available for MySQL version 8.0 onwards (MySQL.com, 2018) and on the development/testing
environment it was installed MySQL 5.6.

PDAM [Submission 1] Page 11 of 18


Figure 23 - Query 6

Figure 24 - Query 6 output

Bibliography
Ballard, C., Farrell, D. M., Gupta, A., Mazuela, C., & Vohnik, S. (2006). Dimensional Modeling: In a Business Inteligence
Environment. NY: International Business Machines Corporation IBM Corp.

MySQL.com. (2018). Aggregate (GROUP BY) Function Descriptions. Retrieved Nov 30, 2018, from MySQL Oracle
Corporation: https://dev.mysql.com/doc/refman/8.0/en/group-by-functions.html

MySQL.com. (2018). WITH Syntax (Common Table Expressions). Retrieved Dec 07, 2008, from MySQL Oracle
Corporation: https://dev.mysql.com/doc/refman/8.0/en/with.html

Nguyen, T., Tjoa, A., & Trujillo, J. (2005). Data Warehousing and Knowledge Discovery: A Chronological View of
Research Challenges. DaWaK 2005, LNCS, 530-535.

PDAM [Submission 1] Page 12 of 18


Rad, R. (2014). Microsoft SQL Server 2014: Business Intelligence Development. Birmingham: Packt Publishing.

Appendices

Query 1

PDAM [Submission 1] Page 13 of 18


SELECT
dim_location.unit_county AS 'County',
COALESCE(dim_location.unit_city, 'TOTAL') AS City,
COALESCE(dim_date.year, 0) AS 'Year',
COUNT(fact_total_treatment_cost.total_no_of_treatments) AS 'No of Treatments',
CONCAT('£ ', SUM(fact_total_treatment_cost.total_cost_of_treatments)) AS 'Total Cost'
FROM fact_total_treatment_cost
INNER JOIN dim_date
ON fact_total_treatment_cost.date_key = dim_date.date_key
INNER JOIN dim_location
ON fact_total_treatment_cost.unit_key = dim_location.unit_key
WHERE dim_location.unit_county = 'hampshire'
GROUP BY dim_date.year DESC,
dim_location.unit_city WITH ROLLUP;

Query 2

SELECT

dim_date.year AS 'Year',

COALESCE(dim_location.unit_city, '') AS 'City',

dim_admission.admission_type AS 'Patient Type',

dim_patient.patient_gender AS 'Gender',

COUNT(fact_total_treatment_cost.total_no_of_treatments) AS 'No of Treatments',

CONCAT('£ ', SUM(fact_total_treatment_cost.total_cost_of_treatments)) AS 'Total Cost',

CONCAT('£ ', ROUND(AVG(fact_total_treatment_cost.total_cost_of_treatments), 2)) AS 'Average Cost'

FROM fact_total_treatment_cost

INNER JOIN dim_date

ON fact_total_treatment_cost.date_key = dim_date.date_key

INNER JOIN dim_location

ON fact_total_treatment_cost.unit_key = dim_location.unit_key

INNER JOIN dim_patient

ON fact_total_treatment_cost.patient_key = dim_patient.patient_key

INNER JOIN dim_admission

ON fact_total_treatment_cost.admission_key = dim_admission.admission_key

WHERE dim_date.year >= YEAR(DATE_SUB(CURDATE(), INTERVAL 5 year))

AND dim_location.unit_city = 'portsmouth'

AND dim_patient.patient_gender = 'm'

AND dim_admission.admission_type = 'outpatient'

GROUP BY dim_date.year DESC

WITH ROLLUP;

Query 3

PDAM [Submission 1] Page 14 of 18


SELECT
dim_date.year AS 'Year',
CONCAT(dim_location.unit_name, ', ', dim_location.unit_city, ', ', dim_location.unit_county) AS 'Location',
CONCAT('£ ', SUM(dim_treatment.treatment_cost)) AS 'Total Cost',
COUNT(fact_total_treatment_cost.total_no_of_treatments) AS 'Treatments',
CONCAT('£ ', ROUND(AVG(dim_treatment.treatment_cost), 2)) AS 'Average Treatment',
CONCAT('£ ', MAX(dim_treatment.treatment_cost)) AS 'Highest Treatment',
CONCAT('£ ', MIN(dim_treatment.treatment_cost)) AS 'Lowest Treatment'
FROM fact_total_treatment_cost
INNER JOIN dim_treatment
ON fact_total_treatment_cost.treatment_key = dim_treatment.treatment_key
INNER JOIN dim_location
ON fact_total_treatment_cost.unit_key = dim_location.unit_key
INNER JOIN dim_date
ON fact_total_treatment_cost.date_key = dim_date.date_key
WHERE dim_location.unit_county = 'hampshire'
OR dim_location.unit_county = 'dorset'
GROUP BY dim_date.year,
dim_location.unit_city,
fact_total_treatment_cost.total_no_of_treatments
ORDER BY `Average Treatment` DESC
LIMIT 10

Query 4

SELECT
COALESCE(dim_patient.patient_key, 'Grand Total') AS 'Patient ID',
SUM(CASE WHEN patient_type = 'Inpatient' THEN 1 ELSE 0 END) AS 'Inpatient',
SUM(CASE WHEN patient_type = 'Outpatient' THEN 1 ELSE 0 END) AS 'Outpatient',
SUM(fact_drugs.total_no_drug) AS 'Total No of Drugs',
CONCAT('£ ', SUM(fact_drugs.total_cost_drug)) AS 'Total Drugs Cost'
FROM fact_drugs
INNER JOIN dim_patient
ON fact_drugs.patient_key = dim_patient.patient_key
INNER JOIN dim_prescription
ON fact_drugs.prescription_key = dim_prescription.prescription_key
GROUP BY dim_patient.patient_key WITH ROLLUP

Query 5

PDAM [Submission 1] Page 15 of 18


SELECT
COALESCE(dim_date.year, 'Grand Total') AS 'Year',
COALESCE(dim_date.month, 'N/A') AS 'Month',
CONCAT('£ ', SUM(CASE WHEN dim_prescription.prescription_pmt = 'free'
THEN fact_drugs.total_cost_drug ELSE 0 END)) AS 'Free',
CONCAT('£ ', ROUND(AVG(CASE WHEN dim_prescription.prescription_pmt = 'free'
THEN fact_drugs.total_cost_drug ELSE 0 END), 2)) AS 'AVG Free',
CONCAT('£ ', SUM(CASE WHEN dim_prescription.prescription_pmt = 'ext'
THEN fact_drugs.total_cost_drug ELSE 0 END)) AS 'Extempted',
CONCAT('£ ', ROUND(AVG(CASE WHEN dim_prescription.prescription_pmt = 'ext'
THEN fact_drugs.total_cost_drug ELSE 0 END), 2)) AS 'AVG Ext',
CONCAT('£ ', SUM(CASE WHEN dim_prescription.prescription_pmt = 'paid'
THEN fact_drugs.total_cost_drug ELSE 0 END)) AS 'Paid',
CONCAT('£ ', ROUND(AVG(CASE WHEN dim_prescription.prescription_pmt = 'paid'
THEN fact_drugs.total_cost_drug ELSE 0 END), 2)) AS 'AVG Paid'
FROM fact_drugs
INNER JOIN dim_prescription
ON fact_drugs.prescription_key = dim_prescription.prescription_key
INNER JOIN dim_date
ON fact_drugs.date_key = dim_date.date_key
GROUP BY dim_date.year,
dim_date.month WITH ROLLUP;

PDAM [Submission 1] Page 16 of 18


Query 6

SELECT
A.*,
CONCAT(ROUND((CASE WHEN (A.total_no_drug IS NULL OR
B.total_no_drug IS NULL OR
B.total_no_drug = 0) THEN 0 ELSE (A.total_no_drug - B.total_no_drug) * 100 /
B.total_no_drug END), 2), ' %') AS 'Drug No Diff',
CONCAT(ROUND((CASE WHEN (A.total_cost_drug IS NULL OR
B.total_cost_drug IS NULL OR
B.total_cost_drug = 0) THEN 0 ELSE (A.total_cost_drug - B.total_cost_drug) * 100 /
B.total_cost_drug END), 2), ' %') AS 'Total Cost Diff'
FROM (SELECT
SUBSTR(year, 1, 4) year, drug_type,
SUM(total_no_drug) total_no_drug,
SUM(total_cost_drug) total_cost_drug
FROM fact_drugs fd
INNER JOIN dim_date dd
ON fd.date_key = dd.date_key
INNER JOIN dim_drug drug
ON fd.drug_key = drug.drug_key
WHERE dd.year BETWEEN 2009 AND 2010
GROUP BY dd.year, drug.drug_type) A
LEFT JOIN (SELECT
SUBSTR(year, 1, 4) year, dim_drug.drug_type,
SUM(total_no_drug) total_no_drug,
SUM(total_cost_drug) total_cost_drug
FROM fact_drugs fd
INNER JOIN dim_date
ON fd.date_key = dim_date.date_key
INNER JOIN dim_drug
ON fd.drug_key = dim_drug.drug_key
WHERE dim_date.year BETWEEN 2009 AND 2010
GROUP BY dim_date.year,
dim_drug.drug_type) B
ON A.YEAR = (B.YEAR + 1)
AND A.drug_type = B.drug_type

PDAM [Submission 1] Page 17 of 18


PDAM [Submission 1] Page 18 of 18

Das könnte Ihnen auch gefallen