SQL SERVER Database Normalization Basics for Developers

THis article covers some of the basics of Database normalization and why they are important, primarily focusing on SQL Server.

Database Guru: Yes, my son, what brings you to my cave today?
Programmer: Your excellence, I've been hiking through the forest for 20 years to find you and ask
you the meaning of database normalization!
Database Guru: My son, database normalization is a fountain!
Programmer: What!! I've beaten the soles off my shoes for years, traipsing through the forests, to hear you say this gibberish?
Database Guru: You mean, database normalization is NOT a fountain?


When I first started using databases, I had no idea what database design was. The most important thing to me was just, "get it to work!". That was a long time ago; I had a lot to learn -- and I'm still learning.  I certainly lay no claim to being a Database Guru -- in fact, I really kind of hate databases, to be perfectly frank. The NOSQL database movement is much more interesting to me.  However, if you want to be a professional developer, you really have no choice but to get real comfy with relational databases. I believe I do understand the basics of good database design and normalization, and what I intend to do here is present key information on the subject culled from a number of sources, including SQL Server Books Online, in the most easily understandable form possible.

Database applications work best with normalized databases. In addition, the resulting applications are more flexible and powerful than those developed against databases that are not normalized. It's possible to "over normalize" a database (just as you can "over-OOP" in programming) which can make it difficult to adapt it to changing requirements. As with anything in life, there is a happy medium, but first you have to know and understand the rules. Then you'll be able to apply them sensibly.

Normalization

The logical design of the database, including the tables and the relationships between them, is the real core of an optimized relational database. A good logical database design lays the foundation for optimal database and application performance. A poor logical database design can impair the performance of the entire system.

Let's say you have a table that holds many thousands of records for your messageboard posts, for example, and you decide you want to add a rating system so readers can "rate" each messageboard post from 1 to 10 stars. Although most of the extra columns of data about ratings aren't directly specific to the process of finding and displaying a messageboard post, it's conceivable you could make the case that you want to add them to the posts table because it is "more convenient". But, from a performance standpoint, it's more likely you've shot yourself in the foot. There will likely be many rows for which nobody ever submitted a rating, or which don't qualify to be rated. This poor design becomes especially pronounced the more rows of data there are in the table. What you really want is to have that table hold only the exact required columns of non-repeating data necessary to provide what you need for its primary purpose (displaying messageboard posts), and to hold that data in the narrowest columns possible in which that data will fit (more on this later). In the normalization process, you would move all the rating information to a separate table, linked back to the main table by the post id. You've achieved the goal of adding your rating feature, but you haven't had to add a single column to the main table, which needs to be optimized for speed. And your new ratings table now only holds records for those posts that actually have a rating, conserving more space. When the rating information matching rows in the main table is needed, you bring it in with a SQL JOIN. This is what good logical design is all about.

Normalizing a logical database design involves using formal methods to separate the data into multiple, related tables. A greater number of narrow tables (with fewer columns) is characteristic of a normalized database. Fewer wide tables (with more columns) is characteristic of a non-normalized database. Databases are like real estate - it's location, location, location! If you have a table with a large number of columns that really ought to be in another table, linked to the first with a foreign key, somebody is going to pay the price for all that extra real estate you've taken up. You'll see performance suffer on selects, updates and on inserts - for several reasons which are explained farther down the page. It's important to understand that even if there is a one-to-one relationship among all the columns in your table, if the extra columns aren't needed for the primary task, you have forced yourself to pay for more real estate in your select statements because you've chosen to make the table physically wider (more columns) than what is needed. This happens even though your select statement may not bring back those extra columns of data at all. So if these extra columns don't directly relate to the primary purpose of the table (in this case, to display messageboard posts) then, by normalizing them out and reducing the number of columns in the primary table, more rows can fit on a single SQL data page, improving performance.

SQL Server Page access on the disk is as objective as gravity: if the stuff you want is better organized on the physical media (the MDF file) it stands to reason that you'll get it back faster. And that's exactly what database optimization and normalization does. We had a forum post recently from someone who was attempting to figure out why certain queries took much longer than others. His database has 72 million rows of 8 columns of data. That's a candidate for a completely different kind of normalization and optimization - federated partitioned views, where the table is actually distributed across two or more physical database servers, yet appears as a contiguous monolithic view to the caller.

Reasonable normalization usually improves performance -- sometimes very significantly. When useful indexes are available, the Microsoft SQL Server 2005 / 2008 query optimizer is most efficient at selecting rapid, efficient joins between tables.

Some of the benefits of normalization include:

Faster sorting and index creation.
A larger number of clustered indexes.
Narrower and more compact indexes.
Fewer indexes per table, which improves the performance of INSERT, UPDATE, and DELETE statements.
Fewer null values and less opportunity for inconsistency, which increase database compactness.

As normalization increases, so do the number and complexity of joins required to retrieve data. Too many complex relational joins between too many tables can begin to hinder performance. Reasonable normalization typically includes few regularly executed queries that use joins that involve more than four tables.

Sometimes you "inherit" a logical database design that is already fixed -- and total redesign is not feasible. Even then, however, it might be possible to normalize a large table selectively into several smaller tables. If the database is accessed through stored procedures, this schema change could take place without affecting applications at all. If not, you might be able to create a view that hides the schema change from the applications.

There are at least eight (8) well-defined rules for database normalization, including 1NF,2NF,3NF,BCNF, 4NF, 5NF, ONF and DKNF. However in the normal operation of database design for developers, the first three are the most important:

1NF - Eliminate Repeating Groups - Make a separate table for each set of related attributes, and give each table a primary key.
2NF - Eliminate Redundant Data - If an attribute depends on only part of a multi-valued key, remove it to a separate table.
3NF - Eliminate Columns Not Dependent on Key - If attributes do not contribute to a description of the key, remove them to a separate table.

Achieving a Well-Designed Database

In relational-database design theory, normalization rules identify certain attributes that must be present or absent in a well-designed database. A complete discussion of normalization rules is beyond the scope of this short article. However, the following few rules can get you a good portion of the way to achieving a sound database design:

A table should have an identifier.

The fundamental rule of database design theory is that each table should have a unique row identifier, a column or set of columns used to distinguish any single record from every other record in the table. Each table should have an ID column, and no two records can share the same ID value. The column or columns serving as the unique row identifier for a table is the primary key of the table. Tables are typically provided with an integer IDENTITY attribute as the primary key, although more and more developers are now using UNIQUEIDENTIFIER (GUID) keys, a technique that has some issues related to performance and fragmentation because of the sheer physical size of the GUID structure. With an integer IDENTITY, you can have up to ~4 Billion rows in a table if you start the seed at the most negative integer number.

A table should store only data for a single type of entity.

Attempting to store too much information in a table can prevent the efficient and reliable management of the data in the table. For example, in the pubs sample database in SQL Server, the titles and publishers information is stored in two separate tables. Although it is possible to have columns that contain information for both the book and the publisher in the titles table, this design is problematic. The publisher information must be added and stored redundantly for each book published by a publisher. This uses extra storage space in the database. If the address for the publisher changes, the change must be made for each book. And if the last book for a publisher is removed from the title table, the information for that publisher is lost. Separating the repeating groups of publishers from the title information results in first normal form (1NF).

In the pubs database, with the information for books and publishers stored in the titles and publishers tables, the information about the publisher has to be entered only once and then linked to each book. Therefore, if the publisher information is changed, it must be changed in only one place, and the publisher information will be there even if the publisher has no books in the database.

When you are first designing your application, don't automatically assume you have to de-normalize your data. Your first goal in the design process should be to normalize your data. Later you can test your design with realistic data and transactions. At this point if you see that de-normalization will help, then you are in a much better position to make that decision. Always specify the narrowest columns you can. The narrower the column, the less amount of data SQL Server has to store, and the faster SQL Server is able to read and write data. For example, if you know that you will never need to store Unicode in a text column, make it type VARCHAR, not NVARCHAR. NVARCHAR stores each character in two bytes, even if it is in the ASCII character set. If you have an integer column that you know will never hold a very high value (not higher than 255), use TinyInt rather than Int. The columns with tinyint data type use only one byte to store their values, in comparison with two bytes, four bytes and eight bytes used to store the columns with smallint, int and bigint data types. When you add all these little optimizations up, on a table with many thousands of rows, it starts to make a difference performance-wise. The engineers who developed SQL Server and other RDBMs systems put all these different data types there for a reason - to allow you to optimize performance.

A table should avoid nullable columns.

Tables can have columns defined to allow null values. A null value indicates that there is no value. Although it can be useful to allow null values in isolated cases, it is best to use them sparingly because they require special handling that increases the complexity of data operations. For best performance, and to reduce bugs, columns should ideally be set to NOT NULL. For example, use of the IS NULL keywords in the WHERE clause makes that portion of the query unable to use an index. One simple way to avoid nullable columns is to set them as NOT NULL and provide a default value when creating the column.

Second Normal Form

A table should not have repeating values or columns.

The table for an item in the database should not contain a list of values for a specific piece of information. For example, a book in the pubs database might be coauthored. If there is a column in the titles table for the name of the author, this presents a problem. One solution is to store the name of both authors in the column, but this makes it difficult to show a list of the individual authors. Another solution is to change the structure of the table to add another column for the name of the second author, but this accommodates only two authors. Yet another column must be added if a book has three authors.

If you find that you need to store a list of values in a single column, or if you have multiple columns for a single piece of data (au_lname1, au_lname2, and so on), you should consider placing the duplicated data in another table with a link back to the primary table. The pubs database has a table for book information and another junction (aka associative) table that stores only the ID values for the books and the IDs of the authors of the books. This design allows any number of authors for a book without modifying the definition of the table and allocates no unused storage space for books with a single author.  This represents Second Normal Form (2NF).

Third Normal Form

In order to be in Third Normal Form, a relation must first fulfill the requirements to be in Second Normal Form .  Recall the general requirements of 2NF:
Remove subsets of data that apply to multiple rows of a table and place them in separate tables. Create relationships between these new tables and their predecessors through the use of foreign keys. Additionally, all attributes (columns) that are not dependent upon the primary key must be eliminated.

Example: The following table is NOT in Third Normal Form:

Company          City        State      ZIP
Acme Widgets     New York NY  10169
ABC Corporation Miami       FL   33196
XYZ, Inc.              Columbia MD 21046

In this example, the city and state are dependent upon the ZIP code.  To place this table in 3NF, two separate tables would be created -- one containing the company name and ZIP code and the other containing city, state, ZIP code pairings.

This may seem overly complex for daily applications and in some cases it may be.  Database designers should always keep in mind the tradeoffs between higher level normal forms and the resource issues that complexity creates.

Why Eliminate Columns Not Dependent On Key?

The motivation for this is the same for second normal form: we want to avoid update and delete anomalies. For example, suppose no records from company ABC were currently stored in the database. With the original single-table design, there would be no record of  company ABC's existence. By having a separate Company table with a field for the Zip, we eliminate this potential issue and streamline our CityStateZip table.

You can study a lot more about database normalization, this represents the very basics. As databases grow in size,  a sound knowledge of database normalization rules and the rationale behind them becomes critically important to your development success.

Conclusion

The biggest risk developers face in designing database is what I call the "SDD Syndrome" (Sloppy Database Design). This is when you throw together anything at all because you are in a hurry or you are just naive and uninformed, and it ends up in production. Before too long, design refactoring and normalization become so painful a process that they never get done. Please, don't do this! Take the time to think your database design through carefully; in many cases it truly represents the very core of your application.

When you follow standard database normalization recommendations in designing databases you can maximize a database's performance. The main reasons why:

It helps reduce the total amount of redundant data in the database. The less data there is, the less work SQL Server has to do, speeding up performance.
It helps to reduce the use of NULLS in the database. The use of NULLs in a database can reduce database performance, especially in WHERE clauses.
It reduces the number of columns in tables, which means that more rows can fit on a single data page, which helps to boost SQL Server read performance.
It helps to reduce the amount of Transact-SQL code that needs to be written to deal with non-normalized data. The less code there is, the less that has to run, speeding your application's performance.
It helps maximize the use of clustered indexes, the most powerful and useful type of index available to SQL Server. The more data is separated into multiple tables because of normalization, the more clustered indexes become available to help speed up data access.

It helps reduce the total number of indexes in your database. The less columns tables have, the less need there is for multiple indexes to retrieve it. The fewer indexes there are, the less negative is the performance effect of INSERTs, UPDATES, and DELETES.

By Peter Bromberg   Popularity  (8598 Views)