Sie sind auf Seite 1von 7

B2 ETL Null Values Treatment

Tony Wang tony.wang@um.es


Business intelligence in biomedicine
Course 2014

Abstract. A little review about the topic: Null Values Treatment in


the process of ETL (Extract, Transform, Load). It tries to explain how
missing data affect the efficiency of the database and why it is needed to
delete or treat null values at the time of applying ETL. That is the reason
of because a study of techniques to treat missing data (null values) is
made.

The current database problem: null values

The decision whether to allow null values when we are building a database depends directly on the type of design of the database and the use thereof. Therefore we have to balance the business needs against performance. For instance, if
we have a begin date and an end date in a table, we often dont know the end
date [1]. For this reason, although having null values in the database adversely
affect the performance, sometimes if you set them that can not be null can be
worse. Another example that explains it better is the following one.
Imagine you have a client data form where you do not allow null values. In
the field another phone number, people who dont have more phone numbers
would put whatever here so they can finish the form. Thus, we have inconsistent
data such as 123456789 and the process of cleaning this data before load it in
our data warehouse could be worse than to allow null values.
Before we start to discuss and study about allow or not allow null values, it
is assumed that in a common database where we have the null values and we
want to apply the process of ETL, we are applying the three relational databases
integrity constraints or rules [2,3]. These are the following ones:
Entity Integrity: every table must have a primary key and that the column
or columns chosen to be primary key should be unique and not null.
Referential Integrity: any foreign key value can only be in one of the two
states: referring to an existing primary key value from another table in the
database or occasionally, and depending on the rules of the business, it can
be null.
Domain Integrity: all the columns in a relational database must be declared
upon a defined domain. The items of the database have to be atomic and
non-decomposable and they have to be in the declared domain.
Once it is said, we can start comparing between having or not having null
values. First of all, it can be a problem when we are integrating the database in

some application programmed in a determinate language. Many programming


languages dont natively recognize a null result set (it can cause an error). Sometimes, a value from the database is calculated based on another value. If this
base value is null then we will get a calculated null value what vulnerates the
integrity of the database. On the other hand, if we are calculating for instance
the age of some person using the date of birth we have to allow null values in
the date of birth. In this case, the null indicates that this data is unknown. If
we put a predefined value as the date of birth then the calculated value would
be wrong.
Then, we can assume that null means that some value is unknown, missing or
irrelevant. It is very useful because in most programming languages, a boolean
can be only true or false and with null values we have a new feature called
three-valued logic. However it has cons too. As the null values has a meaning
in databases, the use of them has the main problem of being redundant. This
means that in relational database model, null values are put into memory as if
they where other values, so they require time to be accessed, compared, used,
etc., and they need space in memory to be stored so it will directly affect to the
performance (e.g in big queries) [4,5,6].
To finish, the main and most important reason to avoid a null value is the
following one. An improperly handled null value can destroy any ETL process.
Null values pose the biggest risk when they are in foreign key columns. Joining
two or more tables based on a column that contains null values will cause data
loss because in a relational database null is not equal to null. That causes joins
to fail so it is important to check for null values in every foreign key in the source
database [7].
1.1

Missing data in clinical databases

The problem of missing clinical data is not only present in informatics databases.
It happens too when we have the data on paper since, after all, a clinical database
does the same function but provides the physician having the information organized and makes the usage be easier. The clinical databases are also used as the
input data for the clinical support decision systems so here we have the main
problem of missing clinical data.
In both cases (using data in paper or using a clinical database and a supporting system), when we are diagnosing a disease of a patient, if we have missing
values then we may find the following problems: death, adverse reactions, unpleasant study procedures, lack of improvement, early recovery, and other factors
related or unrelated to trial procedure and treatments [8]. Also, when we are talking about informatics systems and we are developing one, missing values in the
data base (in this case null values) can be a challenge for the developer. For
example, if we have a field called fever that is a boolean type in the database,
when we retrieve the value and this information is missing (null), in most of
programming languages the value of is initialized as false, what is not true and
may produce a wrong diagnosis.

Finally and before focus the main topic of the article it is good to know
that several investigators claim that there are essentially three classifications for
missing data mechanisms. These are now listed formally as follows [9,10,11].
Missing completely at random (MCAR): The reasons for the missing data
are not associated with the observed or missing values (it do not depend
on any data, either observed or missing). It includes for example lost data,
accidental omission of an answer on a questionnaire, accidental breaking of
laboratory instrument or personnel error (e.g. a dropped test tube in a lab
or an equipment failure).
Missing at random (MAR): Given the observed data, the failure to observe a
value does not depend on the data that are unobserved (e.g. once you know
someones age, their chance of having blood pressure recorded is independent
of their blood pressure level).
Missing not at random (MNAR): The failure to observe a value depends on
the value that would have been observed or other missing values in the data
set (e.g. patients with a high blood pressure are more likely to have blood
pressure measured).

Strategies to solve the problem and improvements


obtained

The first thing and the most important is to check that all the primary keys
are not null in any table (Entity Integrity rule). We also have to check that all
foreign keys are referring to an existing primary key (Referential Integrity rule).
A good solution to avoid losing data when we are doing joins and we have null
foreign keys is to use outer joins. Secondly, is a good practice to put default
values that the database creator choose. These default values are used when we
do not have the information so instead of having null, we can have a string called
unknown, a default date, a negative integer, etc. Some users recommend to
use empty strings too when we do not have the value or it is unknown but it can
be confusing when we are using it in the business intelligence process [12,13].
Finally, there are exist lot of techniques or algorithms to calculate losing data
depending on other data. Below, it is described some of them which are the most
used. These techniques are more used when the lost data is numeric.
Listwise Deletion: This method omits those cases (instances) with missing
data and does analysis on the remains. Though it is the most common
method, it has two obvious disadvantages: a) A substantial decrease in the
size of dataset available for the analysis. b) Data are not always missing
completely at random [14].
Mean/Mode Imputation (MMI): Replace a missing data with the mean (numeric attribute) or mode (nominal attribute) of all cases observed. To reduce
the influence of exceptional data, median can also be used. This is one of
the most common used methods [14].

K-Nearest Neighbour Imputation (KNN): This method uses k-nearest neighbor algorithms to estimate and replace missing data. The main advantages
of this method are that: a) it can estimate both qualitative attributes and
quantitative attributes; b) It is not necessary to build a predictive model for
each attribute with missing data [14].
Classification algorithm: It usually uses a decision tree. A decision tree is
a tree in which each branch node represents a choice between a number of
alternatives, and each leaf node represents a decision. We have to build the
tree and then, input the available data in the tree to evaluate it and to
predict the lost attribute [14].
These include replacing missing values with values imputed from the observed
data (for example, the mean of the observed values), using a missing category
indicator, and replacing missing values with the last measured value (last value
carried forward). None of these approaches is statistically valid in general, and
they can lead to serious bias. Single imputation of missing values usually causes
standard errors to be too small, since it fails to account for the fact that we
are uncertain about the missing values [15]. For these reasons we need more
complex and effective techniques such as Inverse Probability Weighting, Multiple
Imputation (MI), Likelihood-based Analysis, ITT [11], Bayesian Estimator [16],
Database Partioning and Merging [17] or even a Fuzzy Estimator [18].
To wrap up, there are six principles for drawing data inferences, and it is
good to know them so we can easily choose the right technique to apply in the
null values of our database. [19].
1. Determine if possible whether the values that are missing are meaningful for
analysis and hence meet the definition of missing data.
2. Formulate an appropriate and well-defined causal primary measure of treatment effect in terms of the data that were intended to be collected. It is
important to distinguish what is being estimated from the method of estimation, which may vary according to assumptions.
3. Document the reasons why data are missing. Knowing the reasons for missing
data can help formulate sensible assumptions about observations that are
missing.
4. Decide on a primary set of assumptions about the missing-data mechanism.
5. Conduct a statistically valid analysis under the primary missing-data assumptions.
6. Assess the robustness of inferences about treatment effects to various missingdata assumptions and do an analysis with the ratio of success, the deviation
of the inferred data, etc.
2.1

How do Weka and Oracle Data Mining treat the problem

For Weka there exists a package and a method called ReplaceMissingWithUserConstant that replaces all missing values for nominal, string, numeric and date
attributes in the dataset with user-supplied constant values [20]. There are also

Weka packages and in particular there is one called EMImputation that replaces
missing numeric values using Expectation Maximization with a multivariate normal model [21]. We also can apply any technique mentioned before in the section
2. Especially, Weka has two classes called J48 and OneR that are classification
algorithms and that can be used to fill missing data values [14,22].
In the case of Oracle Data Mining (ODM), it distinguishes between sparse
data and data that contains random missing values. The latter means that some
attribute values are unknown and sparse data, on the other hand, contains values
that are assumed to be known, although they are not represented in the data
[23]. In ODM, certain algorithms assume that a null value indicates a missing
value and others assume a null value indicates sparse data. For Support Vector Machine, k-Means, association, and Non-Negative Matrix Factorization, null
values indicate sparse data; for all other algorithms, null values indicate random
missing values and we can not do anything because them dont exist (are really
unknown). If an algorithm assumes that null values indicate sparse data, then
we should treat any value that are true missing values. Anyway ODM is robust
in handling missing values and does not require users to treat missing values
in any special way (it applies the appropriate algorithm to fill the missing data
when it is necessary). ODM will ignore missing values but will use non-missing
data in a case [24].
In previous releases, data transformation was the responsibility of the user. In
Oracle Database 11g, the data preparation process can be automated. Algorithmappropriate transformation instructions are embedded in the model and automatically applied to the build data and scoring data. Handling of sparse data and
missing values has been standardized across algorithms in Oracle Data Mining
11g. In this version we can use the DBMS DATA MINING TRANSFORM package that
includes a variety of missing value and outlier treatments. We also can configure
a GLM model to override the default treatment of missing values by configuring
the ODMS MISSING VALUE TREATMENT setting. It causes the algorithm to delete
rows in the training data that have missing values instead of replacing them with
the mean or the mode. However, when the model is applied, ODM will perform
the usual mean/mode missing value replacement. If we want to delete rows with
missing values in the scoring the model, we must perform the transformation
explicitly [25].

Conclusion

Many database experts say that is better without having null values in any case,
but sometimes them are necessary due to business rules. I think we can do a
good use and exploit of null values as long as afterwards we treat them correctly
when we are developing the application that uses the data. We also have to treat
the null values in the ETL process (for example, through some of the techniques
said before) to have a consistent database and to avoid losing data in the data
warehouse.

I think that in databases where we are storing clinical values and where
having the right values of the data is something necessary to improve the health
of people, is more than necessary to avoid every null value that may compromise
the patients life (or thoroughly propose a good null value treatment strategy).
Anyway, if we want to have a solid data warehouse it is not easy if we have
many missing data. Apply good techniques to inference all this missing data is
a really hard work and anyway it may not retrieve the right data at 100%. Normally, in the process of ETL we use the easiest techniques because the databases
provided are not as complicated and they dont have many calculated null values
(just MCAR values).

References
1. J. Ojvind Nielsen, How do null values affect performance in a
database
search?.
http://stackoverflow.com/questions/1017239/
how-do-null-values-affect-performance-in-a-database-search/,
2009.
Accessed November 26, 2014.
2. P. Singh, Database Management System Concepts, ch. 3, pp. 5859. Vk Publications, 2010.
3. M. del Rocio Boone Rojas, M. Carrillo Ruiz, B. Bernabe Loranca, and M. Soriano Ulloa, Treatment of integrity restrictions in relational dbms with triggers, in
Electronics, Communications and Computers, 2006. CONIELECOMP 2006. 16th
International Conference on, pp. 4545, Feb 2006.
4. C. Rubinson, Nulls, three-valued logic, and ambiguity in sql: Critiquing dates
critique, SIGMOD Rec., vol. 36, pp. 1317, Dec. 2007.
5. J. Brady, Avoiding nulls. http://databaseperformance.blogspot.com.es/
2012/06/avoiding-nulls.html/, 2012. Accessed November 26, 2014.
6. M. A. Poolet, Designing for performance: Null or not null?. http://sqlmag.com/
sql-server-2000/designing-performance-null-or-not-null/, 2006. Accessed
November 26, 2014.
7. Null values. http://www.dataself.com/wiki/NULL_Values/, 2014. Accessed
November 26, 2014.
8. W. J. Shih, Problems in dealing with missing data and informative censoring in
clinical trials, Trials, vol. 3, no. 1, p. 4, 2002.
9. C. A. Welch, I. Petersen, J. W. Bartlett, I. R. White, L. Marston, R. W. Morris,
I. Nazareth, K. Walters, and J. Carpenter, Evaluation of two-fold fully conditional
specification multiple imputation for longitudinal electronic health record data,
Statistics in medicine, 2014.
10. J. G. Ibrahim, H. Chu, and M.-H. Chen, Missing data in clinical studies: issues
and methods, Journal of Clinical Oncology, vol. 30, no. 26, pp. 32973303, 2012.
11. J. D. Dziura, L. A. Post, Q. Zhao, Z. Fu, and P. Peduzzi, Strategies for dealing
with missing data in clinical trials: From design to analysis, The Yale journal of
biology and medicine, vol. 86, no. 3, p. 343, 2013.
12. Etl tip - how to handle null vs the blank fields in data warehouse.
https://analyticsreckoner.wordpress.com/2012/07/25/
etl-tip-how-to-handle-null-vs-the-blank-fields-in-data-warehouse/,
2012. Accessed December 01, 2014.
13. Data
integration
info:
Etl
(extract-transform-load).
http://www.
dataintegration.info/etl/, 2011. Accessed December 01, 2014.

7
14. Minakshi, D. R. Vohra, and Gimpy, Missing value imputation in multi attribute
data set, International Journal of Computer Science and Information Technologies, vol. 5, 2014.
15. J. A. Sterne, I. R. White, J. B. Carlin, M. Spratt, P. Royston, M. G. Kenward,
A. M. Wood, and J. R. Carpenter, Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls, Bmj, vol. 338, 2009.
16. S. Oba, M.-a. Sato, I. Takemasa, M. Monden, K.-i. Matsubara, and S. Ishii, A
bayesian missing value estimation method for gene expression profile data, Bioinformatics, vol. 19, no. 16, pp. 20882096, 2003.
17. T. Shintani, Mining association rules from data with missing values by database
partitioning and merging, in Computer and Information Science, 2006 and 2006
1st IEEE/ACIS International Workshop on Component-Based Software Engineering, Software Architecture and Reuse. ICIS-COMSAR 2006. 5th IEEE/ACIS International Conference on, pp. 193200, IEEE, 2006.
18. S.-J. Lee and X. Zeng, A modular method for estimating null values in relational
database systems, in Intelligent Systems Design and Applications, 2008. ISDA08.
Eighth International Conference on, vol. 2, pp. 415419, IEEE, 2008.
19. R. J. Little, R. DAgostino, M. L. Cohen, K. Dickersin, S. S. Emerson, J. T. Farrar,
C. Frangakis, J. W. Hogan, G. Molenberghs, S. A. Murphy, et al., The prevention
and treatment of missing data in clinical trials, New England Journal of Medicine,
vol. 367, no. 14, pp. 13551360, 2012.
20. Data mining algorithms and tools in weka. http://wiki.pentaho.com/display/
DATAMINING/ReplaceMissingWithUserConstant/, 2014. Accessed December 01,
2014.
21. Weka packages. http://weka.sourceforge.net/packageMetaData/, 2014. Accessed December 01, 2014.
22. I. H. Witten, Data mining with weka. http://www.cs.waikato.ac.nz/ml/weka/
mooc/dataminingwithweka/slides/Class5-DataMiningWithWeka-2013.pdf/.
Accessed December 01, 2014.
23. Oracle data mining users guide: Preparing the data. https://docs.oracle.
com/database/121/DMPRG/xform_casetbl.htm#DMPRG005/. Accessed December
01, 2014.
24. Data mining concepts: Data for oracle data mining. https://docs.oracle.com/
cd/B19306_01/datamine.102/b14339/2data.htm/. Accessed December 01, 2014.
25. Oracle data mining concepts 11g release 2 (11.2). http://people.inf.elte.hu/
kiss/14dwhdm/e16808.pdf/, 2010. Accessed December 01, 2014.

Das könnte Ihnen auch gefallen