Why GUIDs are not a good idea for SQL Server Primary Keys

The GUID (UNIQUEIDENTIFIER) datatype is a wide column (16 bytes). As the primary key it will be stored in a clustered index (unless specified otherwise). This has performance issues.

My informal poll of fellow developers reveals that by about a 4 to 1 margin, we are all using UNIQUEIDENTIFIER primary keys. Not a good idea, usually. In fact, only one poll respondent even mentioned anything about NEWSEQUENTIALID(), which decreases fragmentation and improves performance.

The GUID (UNIQUEIDENTIFIER) will be the page pointer for each leaf page in any non-clustered index. About the only thing you can do to improve the performance of this is to increase the Fill Factor to 100% and rebuild your indexes frequently.

When a GUID is used instead of an integer identity column then the GUID need to be matched for each row that is returned using that column in the where clause. If a high volume of inserts are done on these tables then the GUIDs, being large, will contribute to page splits, as will the fact that NEWID() generates a random value, which could place a new record on any of the data pages. The result? Performance problems.

A good clustering key should be sequential. But, a GUID that is not sequential - like one that has it's value generated in the client or generated by the newid() function in SQL Server can be a bad choice - mostly because of the fragmentation that it creates in the base table but also because of its size. It's unnecessarily wide -- 4 times wider than an int-based identity. Think about it: an int will give you 2 billion unique rows (4 billion if you use negative numbers). If you need more than 2 billion you can always go with bigint (8-byte int) and get 2^63-1 rows. You could use the new NEWSEQUENTIALID() function to improve things, but you're still dealing with 16 bytes of glop. NEWSEQUENTIALID() as the default value for a PK column will prevent undue fragmentation because it really is "sequential".

Kimberly Tripp, who writes extensively on SQL Server and knows a lot more about this stuff than I do, did some comparison estimates on database size:

  • Base Table with 1,000,000 rows (3.8MB for int vs. 15.26MB with UNIQUEIDENTIFIER)
  • 6 nonclustered indexes (22.89MB for int vs. 91.55MB with UNIQUEIDENTIFIER)

What we're looking at is 25MB vs 106MB; this is just for 1 million rows and it really represents only overhead. The fragmentation costs you even more in wasted space and time because of splitting. When you have fragmentation then you need a solid maintenance plan, which has it's own costs and potential for downtime. And, disk space I/O is not free – it takes longer to both read and write 16 bytes than it does for 4 bytes.

If you’re stuck with Guids generated by NewID() (or custom generated at the client) you should consider periodically rebuilding your indexes, say, weekly. Here is a script that will automatically reindex all tables:

USE DatabaseName --Enter the name of the database you want to reindex
DECLARE @TableName varchar(255)
SELECT table_name FROM information_schema.tables
WHERE table_type = 'base table'
OPEN TableCursor
FETCH NEXT FROM TableCursor INTO @TableName
DBCC DBREINDEX(@TableName,' ',90)
FETCH NEXT FROM TableCursor INTO @TableName
CLOSE TableCursor

The above script uses a fill factor of 90%, but only experimentation will tell you what the optimal fill factor is for your particular database schema. You can use SqlProfiler to do tests.

I did some informal tests. I created a test table with an int primary key (identity), then one using newid() (UNIQUEIDENTIFIER), and finally one with NEWSEQUENTIALID(), and I inserted 50,000 rows in each.

The int table inserts took 10 seconds, with 41% fragmentation.
The newId() table inserts took 21 seconds, with 99% fragmentation.
The NewSequentialId() table inserts took 10 seconds (same as int), but with 75% fragmentation.

The times I report include the time to create the table as well as to obtain some statistics.

With NEWID() your fragmentation can be so bad that SQL Server is going to have to hop all over the place to find your rows on a select.

It's better with NEWSEQUENTIALID() because there is less fragmentation, but you are still dealing with 16 byte width. This means your data pages will be bigger and operations will be slower by comparison. This is especially true when the number of rows increases.

The int PK IDENTITY had only 41% fragmentation which not only meant faster inserts, but much faster selects as well.

NEWSEQUENTIALID() will generate fewer pages than the NEWID() approach but it still generates more pages than the IDENTITY approach. That's because the uniqueidentifier data type consumes 16 bytes of disk space as opposed to the 4 bytes used by the integer data type that was used for the IDENTITY. Considering that SQL Server pages are generally capped at 8K or roughly 8060 bytes, this leads to more pages generated for the NEWSEQUENTIALID() approach as opposed to the IDENTITY approach.

So what can you do if you're stuck with an ever - growing table that has a GUID primary key and you're starting to see performance issues on inserts? One thing you can do is take away the primary key and default clustered index on the GUID column and just use it as a unique "Marker" column. You would then add a new column of type int or bigint, and make this the new Primary key. Your inserts can be much faster, and you can still use your GUID column.

If you really need to optimize for fast inserts, then a table with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best.

There are certainly still situations where using GUIDs as primary keys make sense (for example, in replication). I don't know about you, but the next time I develop a SQL Server database schema, I'm going to think long and hard about why I need to use GUIDS for primary keys. It's a situation that is very difficult to change once you're in production.

By Peter Bromberg   Popularity  (10485 Views)