Normalization Explain

Normalization
The logical design of the database, including the tables and the relationships
between them, is the real core of an optimized relational database. A good
logicaldatabase design lays the foundation for optimal database and application
performance. A poor logical database design can impair the performance of the
entire system.
Let's say you have a table that holds many thousands of records for your
messageboard posts, for example, and you decide you want to add a rating
system so readers can "rate" each messageboard post from 1 to 10 stars.
Although most of the extra columns of data about ratings aren't directly specific
to the process of finding and displaying a messageboard post, it's conceivable
you could make the case that you want to add them to the posts table because it
is "more convenient". But, from a performance standpoint, it's more likely you've
shot yourself in the foot. There will likely be many rows for which nobody ever
submitted a rating, or which don't qualify to be rated. This poor design becomes
especially pronounced the more rows of data there are in the table. What you
really want is to have that table hold only the exact required columns of nonrepeating data necessary to provide what you need for its primary purpose
(displaying messageboard posts), and to hold that data in the narrowest columns
possible in which that data will fit (more on this later). In
the normalizationprocess, you would move all the rating information to a
separate table, linked back to the main table by the post id. You've achieved the
goal of adding your rating feature, but you haven't had to add a single column to
the main table, which needs to be optimized for speed. And your new ratings
table now only holds records for those posts that actually have a rating,
conserving more space. When the rating information matching rows in the main
table is needed, you bring it in with a SQL JOIN. This is what good logical design
is all about.
Normalizing a logical database design involves using formal methods to separate
the data into multiple, related tables. A greater number of narrow tables (with
fewer columns) is characteristic of a normalized database. Fewer wide tables
(with more columns) is characteristic of a non-normalized database. Databases
are like real estate - it's location, location, location! If you have a table with a
large number of columns that really ought to be in another table, linked to the
first with a foreign key, somebody is going to pay the price for all that extra real
estate you've taken up. You'll see performance suffer on selects, updates and on
inserts - for several reasons which are explained farther down the page. It's
important to understand that even if there is a one-to-one relationship among all
the columns in your table, if the extra columns aren't needed for the primary
task, you have forced yourself to pay for more real estate in your select
statements because you've chosen to make the table physically wider (more
columns) than what is needed. This happens even though your select statement
may not bring back those extra columns of data at all. So if these extra columns
don't directly relate to the primary purpose of the table (in this case, to display
messageboard posts) then, by normalizing them out and reducing the number of
columns in the primary table, more rows can fit on a single SQL data page,
improving performance.
SQL Server Page access on the disk is as objective as gravity: if the stuff you
want is better organized on the physical media (the MDF file) it stands to reason
that you'll get it back faster. And that's exactly what database optimization
and normalization does. We had a forum post recently from someone who was
attempting to figure out why certain queries took much longer than others.
His database has 72 million rows of 8 columns of data. That's a candidate for a
completely different kind of normalization and optimization - federated
partitioned views, where the table is actually distributed across two or more
physicaldatabase servers, yet appears as a contiguous monolithic view to the
caller.
Reasonable normalization usually improves performance -- sometimes very
significantly. When useful indexes are available, the Microsoft SQL Server 2005 /
2008 query optimizer is most efficient at selecting rapid, efficient joins between
tables.
Some of the benefits of normalization include:
Faster sorting and index creation.
A larger number of clustered indexes.
Narrower and more compact indexes.
Fewer indexes per table, which improves the performance of INSERT, UPDATE,
and DELETE statements.
Fewer null values and less opportunity for inconsistency, which
increase database compactness.
As normalization increases, so do the number and complexity of joins required to
retrieve data. Too many complex relational joins between too many tables can
begin to hinder performance. Reasonable normalization typically includes few
regularly executed queries that use joins that involve more than four tables.
Sometimes you "inherit" a logical database design that is already fixed -- and
total redesign is not feasible. Even then, however, it might be possible to
normalize a large table selectively into several smaller tables. If the database is
accessed through stored procedures, this schema change could take place
without affecting applications at all. If not, you might be able to create a view
that hides the schema change from the applications.
There are at least eight (8) well-defined rules for database normalization,
including 1NF,2NF,3NF,BCNF, 4NF, 5NF, ONF and DKNF. However in the normal
operation of database design for developers, the first three are the most
important:
1NF - Eliminate Repeating Groups - Make a separate table for each set of related
attributes, and give each table a primary key.
2NF - Eliminate Redundant Data - If an attribute depends on only part of a multivalued key, remove it to a separate table.
3NF - Eliminate Columns Not Dependent on Key - If attributes do not contribute
to a description of the key, remove them to a separate table.
Achieving a Well-Designed Database
In relational-database design theory, normalization rules identify certain
attributes that must be present or absent in a well-designed database. A
complete discussion of normalization rules is beyond the scope of this short
article. However, the following few rules can get you a good portion of the way to
achieving a sound database design:
A table should have an identifier.
The fundamental rule of database design theory is that each table should have a
unique row identifier, a column or set of columns used to distinguish any single
record from every other record in the table. Each table should have an ID
column, and no two records can share the same ID value. The column or columns
serving as the unique row identifier for a table is the primary key of the table.
Tables are typically provided with an integer IDENTITY attribute as the primary
key, although more and more developers are now using UNIQUEIDENTIFIER
(GUID) keys, a technique that has some issues related to performance and
fragmentation because of the sheer physical size of the GUID structure. With an
integer IDENTITY, you can have up to ~4 Billion rows in a table if you start the
seed at the most negative integer number.
A table should store only data for a single type of entity.
Attempting to store too much information in a table can prevent the efficient and
reliable management of the data in the table. For example, in the pubs
sample database in SQL Server, the titles and publishers information is stored in
two separate tables. Although it is possible to have columns that contain
information for both the book and the publisher in the titles table, this design is
problematic. The publisher information must be added and stored redundantly
for each book published by a publisher. This uses extra storage space in
the database. If the address for the publisher changes, the change must be
made for each book. And if the last book for a publisher is removed from the title
table, the information for that publisher is lost. Separating the repeating groups
of publishers from the title information results in first normal form (1NF).
In the pubs database, with the information for books and publishers stored in the
titles and publishers tables, the information about the publisher has to be
entered only once and then linked to each book. Therefore, if the publisher
information is changed, it must be changed in only one place, and the publisher
information will be there even if the publisher has no books in the database.
When you are first designing your application, don't automatically assume you
have to de-normalize your data. Your first goal in the design process should be to
normalize your data. Later you can test your design with realistic data and
transactions. At this point if you see that de-normalization will help, then you are
in a much better position to make that decision. Always specify the narrowest
columns you can. The narrower the column, the less amount of data SQL
Serverhas to store, and the faster SQL Server is able to read and write data. For
example, if you know that you will never need to store Unicode in a text column,
make it type VARCHAR, not NVARCHAR. NVARCHAR stores each character in two
bytes, even if it is in the ASCII character set. If you have an integer column that
you know will never hold a very high value (not higher than 255), use TinyInt
rather than Int. The columns with tinyint data type use only one byte to store
their values, in comparison with two bytes, four bytes and eight bytes used to
store the columns with smallint, int and bigint data types. When you add all
these little optimizations up, on a table with many thousands of rows, it starts to
make a difference performance-wise. The engineers who developed SQL
Server and other RDBMs systems put all these different data types there for a
reason - to allow you to optimize performance.
A table should avoid nullable columns.
Tables can have columns defined to allow null values. A null value indicates that
there is no value. Although it can be useful to allow null values in isolated cases,
it is best to use them sparingly because they require special handling that
increases the complexity of data operations. For best performance, and to reduce
bugs, columns should ideally be set to NOT NULL. For example, use of the IS
NULL keywords in the WHERE clause makes that portion of the query unable to
use an index. One simple way to avoid nullable columns is to set them as NOT
NULL and provide a default value when creating the column.
Second Normal Form
A table should not have repeating values or columns.
The table for an item in the database should not contain a list of values for a
specific piece of information. For example, a book in the pubs database might be
coauthored. If there is a column in the titles table for the name of the author, this
presents a problem. One solution is to store the name of both authors in the
column, but this makes it difficult to show a list of the individual authors. Another
solution is to change the structure of the table to add another column for the
name of the second author, but this accommodates only two authors. Yet
another column must be added if a book has three authors.
If you find that you need to store a list of values in a single column, or if you
have multiple columns for a single piece of data (au_lname1, au_lname2, and so
on), you should consider placing the duplicated data in another table with a link
back to the primary table. The pubs database has a table for book information
and another junction (aka associative) table that stores only the ID values for the
books and the IDs of the authors of the books. This design allows any number of
authors for a book without modifying the definition of the table and allocates no
unused storage space for books with a single author. This represents Second
Normal Form (2NF).
Third Normal Form
In order to be in Third Normal Form, a relation must first fulfill the requirements
to be in Second Normal Form . Recall the general requirements of 2NF:
Remove subsets of data that apply to multiple rows of a table and place them in
separate tables. Create relationships between these new tables and their
predecessors through the use of foreign keys. Additionally, all attributes
(columns) that are not dependent upon the primary key must be eliminated.
Example: The following table is NOT in Third Normal Form:
Company
City
State
ZIP
Acme Widgets
New York NY 10169
ABC Corporation Miami
FL 33196
XYZ, Inc.
Columbia MD 21046
In this example, the city and state are dependent upon the ZIP code. To place
this table in 3NF, two separate tables would be created -- one containing the
company name and ZIP code and the other containing city, state, ZIP code
pairings.
This may seem overly complex for daily applications and in some cases it may
be. Database designers should always keep in mind the tradeoffs between
higher level normal forms and the resource issues that complexity creates.
Why Eliminate Columns Not Dependent On Key?
The motivation for this is the same for second normal form: we want to avoid
update and delete anomalies. For example, suppose no records from company
ABC were currently stored in the database. With the original single-table design,
there would be no record of company ABC's existence. By having a separate
Company table with a field for the Zip, we eliminate this potential issue and
streamline our CityStateZip table.
You can study a lot more about database normalization, this represents the very
basics. As databases grow in size, a sound knowledge
of databasenormalization rules and the rationale behind them becomes critically
important to your development success.
Conclusion
The biggest risk developers face in designing database is what I call the "SDD
Syndrome" (Sloppy Database Design). This is when you throw together anything
at all because you are in a hurry or you are just naive and uninformed, and it
ends up in production. Before too long, design refactoring
and normalizationbecome so painful a process that they never get done. Please,
don't do this! Take the time to think your database design through carefully; in
many cases it truly represents the very core of your application.
When you follow standard database normalization recommendations in designing
databases you can maximize a database's performance. The main reasons why:
It helps reduce the total amount of redundant data in the database. The less data
there is, the less work SQL Server has to do, speeding up performance.
It helps to reduce the use of NULLS in the database. The use of NULLs in
a database can reduce database performance, especially in WHERE clauses.
It reduces the number of columns in tables, which means that more rows can fit
on a single data page, which helps to boost SQL Server read performance.
It helps to reduce the amount of Transact-SQL code that needs to be written to
deal with non-normalized data. The less code there is, the less that has to run,
speeding your application's performance.
It helps maximize the use of clustered indexes, the most powerful and useful
type of index available to SQL Server. The more data is separated into multiple
tables because of normalization, the more clustered indexes become available to
help speed up data access.
It helps reduce the total number of indexes in your database. The less columns
tables have, the less need there is for multiple indexes to retrieve it. The fewer
indexes there are, the less negative is the performance effect of INSERTs,
UPDATES, and DELETES.

Normalization Explain

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Normalization Explain

Hochgeladen von

Copyright:

Verfügbare Formate

Normalization

Das könnte Ihnen auch gefallen