Beruflich Dokumente
Kultur Dokumente
Data Normalization:
Introduction:
Data normalization is a standarized way of making data structure clean and keeping it efficient
by eliminating the data duplication and errors in data operation. “It is a process in which data
attributes within a data model are organized to increase the cohesion of entity types”. The aim of
conducting data normalization process in a set of data is to eliminate the data redundancy
because it is difficult in a relational database to store objects sharing similar attributes in several
tables.
For a successful database design, data normalization plays a vital role. Without normalization the
database operations can generate errors and the database system can be poor, inefficient and
inaccurate.
Normalization techniques:
Normalization is a process of efficiently organizing the data in the database. It ensures that there
is no redundant data and the data dependency exists on each set of data. This will helps in
reducing the data space and increasing the performance.
Normalization techniques are the set of rules and each rule is called “Normal Form” or NF.
Forms ranges from the first normal form (1NF) to fifth normal form (5NF) into a series of
increasing normalization level. There is also one higher level, called domain key normal form
(DK/NF). For the time being the mainly three basic different forms of data are described.
First normal form (1NF): An entity type is said to be in first normal form if it does not contain
any repeating columns in a table. The First normal form can be achieved by
Second normal form (2NF): An entity type is said to be in second normal form when all of its
attributes depends upon the primary key of the table satisfying the rule of 1NF. The Second
normal form can be achieved by
iii. Breaking the table and placing the related entity on the separate table with
unique identifier.
Third normal form (3NF): An entity type is in 3NF when it is in 2NF and all of its attributes are
directly dependent on the primary key satisfying the rule of 2NF
i. Third Normal form can be achieved by further splitting the second normal form.
There are several issues with the development while processing with non normalized data.
Therefore, the data set needs to be normalized before processing the data set providing functional
dependencies and reducing non-key data redundancy.
Advantage / Goals:
Efficient and Functional database is a key to successful development. This can be achieved
through normalization by storing a data in the database where it is logically and uniquely
belongs. There are mainly four objective of normalization:
i. Arranging data into logical groups such that each group describes a small entity
of the whole.
ii. Minimizing the amount of duplicated data stored in a database.
iii. Building a database in which we can access and manipulate the data quickly and
efficiently without compromising the integrity of the data storage.
iv. Organising the data such that, when you modify it, you make the changes in
only one place.
For this research purpose, the original dataset was taken from the home office site
http://rds.homeoffice.gov.uk/rds/soti.html having the crime details within UK from 2003 to 2010
with a total number of offences of 79272. The data was in the comma separated value (.CSV)
format.
As per the requirement of my research, I used MSSQL for storing and manipulating the dataset.
Therefore I need to convert the .CSV file into MSSQL server. Now the second phase of the
database was to get standarized dataset through data normalization. The steps are as follows.
Hence the final Normalized data was obtained for the research.
Data Clustering:
Introduction:
1. K-Means Algorithm.
Where is a distance measured between a data point and the cluster centre
, is an indicator of the distance of the n data points from their respective cluster centres.
i. Firstly place k points into the space s represented by the objects that are being
clustered. These points represent the initial centroid for the groups.
ii. Through the distance measured, assign each object to the group having closest
centroid.
iii. After assigning all the objects, recalculate the centroid position.
iv. Repeat step ii to iii until the centroids no longer moves from their previous
position. This forms a separated group of corresponding objects.