Database Guru: Yes, my son, what brings you to my cave today?
Programmer: Your excellence, I've been hiking through the forest for 20 years to
find you and ask you the meaning of database normalization!
Database Guru: My son, database normalization is a fountain!
Programmer: What!! I've beaten the soles off my shoes for years, traipsing through
the forests, to hear you say this gibberish?
Database Guru: You mean, database normalization is NOT a fountain?
When I first started using databases, I had no idea what database design was. The
most important thing to me was just, "get it to work!". That was a
long time ago; I had a lot to learn -- and I'm still learning. I certainly lay
no claim to being a Database Guru -- in fact, I really kind of hate databases,
to be perfectly frank. The NOSQL database movement is much more interesting to
me. However, if you want to be a professional developer, you really have no
choice but to get real comfy with relational databases. I believe I do understand
the basics of good database design and normalization, and what I intend to do
here is present key information on the subject culled from a number of sources,
including SQL Server Books Online, in the most easily understandable form possible.
Database applications work best with normalized databases. In addition, the resulting
applications are more flexible and powerful than those developed against databases
that are not normalized. It's possible to "over normalize" a database
(just as you can "over-OOP" in programming) which can make it difficult
to adapt it to changing requirements. As with anything in life, there is a happy
medium, but first you have to know and understand the rules. Then you'll be able
to apply them sensibly.
Normalization
The logical design of the database, including the tables and the relationships between
them, is the real core of an optimized relational database. A good logical database
design lays the foundation for optimal database and application performance.
A poor logical database design can impair the performance of the entire system.
Let's say you have a table that holds many thousands of records for your messageboard
posts, for example, and you decide you want to add a rating system so readers
can "rate" each messageboard post from 1 to 10 stars. Although most
of the extra columns of data about ratings aren't directly specific to the process
of finding and displaying a messageboard post, it's conceivable you could make
the case that you want to add them to the posts table because it is "more
convenient". But, from a performance standpoint, it's more likely you've
shot yourself in the foot. There will likely be many rows for which nobody ever
submitted a rating, or which don't qualify to be rated. This poor design becomes
especially pronounced the more rows of data there are in the table. What you
really want is to have that table hold only the exact required columns of non-repeating
data necessary to provide what you need for its primary purpose (displaying messageboard
posts), and to hold that data in the narrowest columns possible in which that
data will fit (more on this later). In the normalization process, you would move
all the rating information to a separate table, linked back to the main table
by the post id. You've achieved the goal of adding your rating feature, but you
haven't had to add a single column to the main table, which needs to be optimized
for speed. And your new ratings table now only holds records for those posts
that actually have a rating, conserving more space. When the rating information
matching rows in the main table is needed, you bring it in with a SQL JOIN. This
is what good logical design is all about.
Normalizing a logical database design involves using formal methods to separate the
data into multiple, related tables. A greater number of narrow tables (with fewer
columns) is characteristic of a normalized database. Fewer wide tables (with
more columns) is characteristic of a non-normalized database. Databases are like
real estate - it's location, location, location! If you have a table with a large
number of columns that really ought to be in another table, linked to the first
with a foreign key, somebody is going to pay the price for all that extra real
estate you've taken up. You'll see performance suffer on selects, updates and
on inserts - for several reasons which are explained farther down the page. It's
important to understand that even if there is a one-to-one relationship among
all the columns in your table, if the extra columns aren't needed for the primary
task, you have forced yourself to pay for more real estate in your select statements
because you've chosen to make the table physically wider (more columns) than
what is needed. This happens even though your select statement may not bring
back those extra columns of data at all. So if these extra columns don't directly
relate to the primary purpose of the table (in this case, to display messageboard
posts) then, by normalizing them out and reducing the number of columns in the
primary table, more rows can fit on a single SQL data page, improving performance.
SQL Server Page access on the disk is as objective as gravity: if the stuff you want
is better organized on the physical media (the MDF file) it stands to reason
that you'll get it back faster. And that's exactly what database optimization
and normalization does. We had a forum post recently from someone who was attempting
to figure out why certain queries took much longer than others. His database
has 72 million rows of 8 columns of data. That's a candidate for a completely
different kind of normalization and optimization - federated partitioned views,
where the table is actually distributed across two or more physical database
servers, yet appears as a contiguous monolithic view to the caller.
Reasonable normalization usually improves performance -- sometimes very significantly.
When useful indexes are available, the Microsoft SQL Server 2005 / 2008 query
optimizer is most efficient at selecting rapid, efficient joins between tables.
Some of the benefits of normalization include:
Faster sorting and index creation.
A larger number of clustered indexes.
Narrower and more compact indexes.
Fewer indexes per table, which improves the performance of INSERT, UPDATE, and DELETE
statements.
Fewer null values and less opportunity for inconsistency, which increase database
compactness.
As normalization increases, so do the number and complexity of joins required to
retrieve data. Too many complex relational joins between too many tables can
begin to hinder performance. Reasonable normalization typically includes few
regularly executed queries that use joins that involve more than four tables.
Sometimes you "inherit" a logical database design that is already fixed
-- and total redesign is not feasible. Even then, however, it might be possible
to normalize a large table selectively into several smaller tables. If the database
is accessed through stored procedures, this schema change could take place without
affecting applications at all. If not, you might be able to create a view that
hides the schema change from the applications.
There are at least eight (8) well-defined rules for database normalization, including
1NF,2NF,3NF,BCNF, 4NF, 5NF, ONF and DKNF. However in the normal operation of
database design for developers, the first three are the most important:
1NF - Eliminate Repeating Groups - Make a separate table for each set of related attributes,
and give each table a primary key.
2NF - Eliminate Redundant Data - If an attribute depends on only part of a multi-valued
key, remove it to a separate table.
3NF - Eliminate Columns Not Dependent on Key - If attributes do not contribute to a description
of the key, remove them to a separate table.
Achieving a Well-Designed Database
In relational-database design theory, normalization rules identify certain attributes
that must be present or absent in a well-designed database. A complete discussion
of normalization rules is beyond the scope of this short article. However, the
following few rules can get you a good portion of the way to achieving a sound
database design:
A table should have an identifier.
The fundamental rule of database design theory is that each table should have a unique
row identifier, a column or set of columns used to distinguish any single record
from every other record in the table. Each table should have an ID column, and
no two records can share the same ID value. The column or columns serving as
the unique row identifier for a table is the primary key of the table. Tables
are typically provided with an integer IDENTITY attribute as the primary key,
although more and more developers are now using UNIQUEIDENTIFIER (GUID) keys,
a technique that has some issues related to performance and fragmentation because
of the sheer physical size of the GUID structure. With an integer IDENTITY, you
can have up to ~4 Billion rows in a table if you start the seed at the most negative
integer number.
A table should store only data for a single type of entity.
Attempting to store too much information in a table can prevent the efficient and
reliable management of the data in the table. For example, in the pubs sample
database in SQL Server, the titles and publishers information is stored in two
separate tables. Although it is possible to have columns that contain information
for both the book and the publisher in the titles table, this design is problematic.
The publisher information must be added and stored redundantly for each book
published by a publisher. This uses extra storage space in the database. If the
address for the publisher changes, the change must be made for each book. And
if the last book for a publisher is removed from the title table, the information
for that publisher is lost. Separating the repeating groups of publishers from
the title information results in first normal form (1NF).
In the pubs database, with the information for books and publishers stored in the
titles and publishers tables, the information about the publisher has to be entered
only once and then linked to each book. Therefore, if the publisher information
is changed, it must be changed in only one place, and the publisher information
will be there even if the publisher has no books in the database.
When you are first designing your application, don't automatically assume you have
to de-normalize your data. Your first goal in the design process should be to
normalize your data. Later you can test your design with realistic data and transactions.
At this point if you see that de-normalization will help, then you are in a much
better position to make that decision. Always specify the narrowest columns you
can. The narrower the column, the less amount of data SQL Server has to store,
and the faster SQL Server is able to read and write data. For example, if you
know that you will never need to store Unicode in a text column, make it type
VARCHAR, not NVARCHAR. NVARCHAR stores each character in two bytes, even if it
is in the ASCII character set. If you have an integer column that you know will
never hold a very high value (not higher than 255), use TinyInt rather than Int.
The columns with tinyint data type use only one byte to store their values, in
comparison with two bytes, four bytes and eight bytes used to store the columns
with smallint, int and bigint data types. When you add all these little optimizations
up, on a table with many thousands of rows, it starts to make a difference performance-wise.
The engineers who developed SQL Server and other RDBMs systems put all these
different data types there for a reason - to allow you to optimize performance.
A table should avoid nullable columns.
Tables can have columns defined to allow null values. A null value indicates that
there is no value. Although it can be useful to allow null values in isolated
cases, it is best to use them sparingly because they require special handling
that increases the complexity of data operations. For best performance, and to
reduce bugs, columns should ideally be set to NOT NULL. For example, use of the
IS NULL keywords in the WHERE clause makes that portion of the query unable to
use an index. One simple way to avoid nullable columns is to set them as NOT
NULL and provide a default value when creating the column.
Second Normal Form
A table should not have repeating values or columns.
The table for an item in the database should not contain a list of values for a specific
piece of information. For example, a book in the pubs database might be coauthored.
If there is a column in the titles table for the name of the author, this presents
a problem. One solution is to store the name of both authors in the column, but
this makes it difficult to show a list of the individual authors. Another solution
is to change the structure of the table to add another column for the name of
the second author, but this accommodates only two authors. Yet another column
must be added if a book has three authors.
If you find that you need to store a list of values in a single column, or if you
have multiple columns for a single piece of data (au_lname1, au_lname2, and so
on), you should consider placing the duplicated data in another table with a
link back to the primary table. The pubs database has a table for book information
and another junction (aka associative) table that stores only the ID values for
the books and the IDs of the authors of the books. This design allows any number
of authors for a book without modifying the definition of the table and allocates
no unused storage space for books with a single author. This represents Second
Normal Form (2NF).
Third Normal Form
In order to be in Third Normal Form, a relation must first fulfill the requirements
to be in Second Normal Form . Recall the general requirements of 2NF:
Remove subsets of data that apply to multiple rows of a table and place them in separate
tables. Create relationships between these new tables and their predecessors
through the use of foreign keys. Additionally, all attributes (columns) that
are not dependent upon the primary key must be eliminated.
Example: The following table is NOT in Third Normal Form:
Company City State ZIP
Acme Widgets New York NY 10169
ABC Corporation Miami FL 33196
XYZ, Inc. Columbia MD 21046
In this example, the city and state are dependent upon the ZIP code. To place this
table in 3NF, two separate tables would be created -- one containing the company
name and ZIP code and the other containing city, state, ZIP code pairings.
This may seem overly complex for daily applications and in some cases it may be. Database
designers should always keep in mind the tradeoffs between higher level normal
forms and the resource issues that complexity creates.
Why Eliminate Columns Not Dependent On Key?
The motivation for this is the same for second normal form: we want to avoid update
and delete anomalies. For example, suppose no records from company ABC were currently
stored in the database. With the original single-table design, there would be
no record of company ABC's existence. By having a separate Company table with
a field for the Zip, we eliminate this potential issue and streamline our CityStateZip
table.
You can study a lot more about database normalization, this represents the very basics.
As databases grow in size, a sound knowledge of database normalization rules
and the rationale behind them becomes critically important to your development
success.
Conclusion
The biggest risk developers face in designing database is what I call the "SDD
Syndrome" (Sloppy Database Design). This is when you throw together anything
at all because you are in a hurry or you are just naive and uninformed, and it
ends up in production. Before too long, design refactoring and normalization
become so painful a process that they never get done. Please, don't do this!
Take the time to think your database design through carefully; in many cases
it truly represents the very core of your application.
When you follow standard database normalization recommendations in designing databases
you can maximize a database's performance. The main reasons why:
It helps reduce the total amount of redundant data in the database. The less data
there is, the less work SQL Server has to do, speeding up performance.
It helps to reduce the use of NULLS in the database. The use of NULLs in a database
can reduce database performance, especially in WHERE clauses.
It reduces the number of columns in tables, which means that more rows can fit on
a single data page, which helps to boost SQL Server read performance.
It helps to reduce the amount of Transact-SQL code that needs to be written to deal
with non-normalized data. The less code there is, the less that has to run, speeding
your application's performance.
It helps maximize the use of clustered indexes, the most powerful and useful type
of index available to SQL Server. The more data is separated into multiple tables
because of normalization, the more clustered indexes become available to help
speed up data access.
It helps reduce the total number of indexes in your database. The less columns tables
have, the less need there is for multiple indexes to retrieve it. The fewer indexes
there are, the less negative is the performance effect of INSERTs, UPDATES, and
DELETES.