My informal poll of fellow developers reveals that by about a 4 to 1 margin, we are
all using UNIQUEIDENTIFIER primary keys. Not a good idea, usually. In fact, only
one poll respondent even mentioned anything about NEWSEQUENTIALID(), which decreases
fragmentation and improves performance.
The GUID (UNIQUEIDENTIFIER) will be the page pointer for each leaf page in any non-clustered
index. About the only thing you can do to improve the performance of this is
to increase the Fill Factor to 100% and rebuild your indexes frequently.
When a GUID is used instead of an integer identity column then the GUID need to be
matched for each row that is returned using that column in the where clause.
If a high volume of inserts are done on these tables then the GUIDs, being large,
will contribute to page splits, as will the fact that NEWID() generates a random
value, which could place a new record on any of the data pages. The result? Performance
problems.
A good clustering key should be sequential. But, a GUID that is not sequential -
like one that has it's value generated in the client or generated by the newid()
function in SQL Server can be a bad choice - mostly because of the fragmentation
that it creates in the base table but also because of its size. It's unnecessarily
wide -- 4 times wider than an int-based identity. Think about it: an int will
give you 2 billion unique rows (4 billion if you use negative numbers). If you
need more than 2 billion you can always go with bigint (8-byte int) and get 2^63-1
rows. You could use the new NEWSEQUENTIALID() function to improve things, but
you're still dealing with 16 bytes of glop. NEWSEQUENTIALID() as the default
value for a PK column will prevent undue fragmentation because it really is "sequential".
Kimberly Tripp, who writes extensively on SQL Server and knows a lot more about this
stuff than I do, did some comparison estimates on database size:
- Base Table with 1,000,000 rows (3.8MB for int vs. 15.26MB with UNIQUEIDENTIFIER)
- 6 nonclustered indexes (22.89MB for int vs. 91.55MB with UNIQUEIDENTIFIER)
What we're looking at is 25MB vs 106MB; this is just for 1 million rows and it really
represents only overhead. The fragmentation costs you even more in wasted space
and time because of splitting. When you have fragmentation then you need a solid
maintenance plan, which has it's own costs and potential for downtime. And, disk
space I/O is not free – it takes longer to both read and write 16 bytes than
it does for 4 bytes.
If you’re stuck with Guids generated by NewID() (or custom generated at the client)
you should consider periodically rebuilding your indexes, say, weekly. Here is
a script that will automatically reindex all tables:
USE DatabaseName --Enter the name of the database you want to reindex
DECLARE @TableName varchar(255)
DECLARE TableCursor CURSOR FOR
SELECT table_name FROM information_schema.tables
WHERE table_type = 'base table'
OPEN TableCursor
FETCH NEXT FROM TableCursor INTO @TableName
WHILE @@FETCH_STATUS = 0
BEGIN
DBCC DBREINDEX(@TableName,' ',90)
FETCH NEXT FROM TableCursor INTO @TableName
END
CLOSE TableCursor
DEALLOCATE TableCursor
The above script uses a fill factor of 90%, but only experimentation will tell you
what the optimal fill factor is for your particular database schema. You can
use SqlProfiler to do tests.
I did some informal tests. I created a test table with an int primary key (identity),
then one using newid() (UNIQUEIDENTIFIER), and finally one with NEWSEQUENTIALID(),
and I inserted 50,000 rows in each.
The int table inserts took 10 seconds, with 41% fragmentation.
The newId() table inserts took 21 seconds, with 99% fragmentation.
The NewSequentialId() table inserts took 10 seconds (same as int), but with 75% fragmentation.
The times I report include the time to create the table as well as to obtain some
statistics.
With NEWID() your fragmentation can be so bad that SQL Server is going to have to
hop all over the place to find your rows on a select.
It's better with NEWSEQUENTIALID() because there is less fragmentation, but you are
still dealing with 16 byte width. This means your data pages will be bigger and
operations will be slower by comparison. This is especially true when the number
of rows increases.
The int PK IDENTITY had only 41% fragmentation which not only meant faster inserts,
but much faster selects as well.
NEWSEQUENTIALID() will generate fewer pages than the NEWID() approach but it still
generates more pages than the IDENTITY approach. That's because the uniqueidentifier
data type consumes 16 bytes of disk space as opposed to the 4 bytes used by the
integer data type that was used for the IDENTITY. Considering that SQL Server
pages are generally capped at 8K or roughly 8060 bytes, this leads to more pages
generated for the NEWSEQUENTIALID() approach as opposed to the IDENTITY approach.
So what can you do if you're stuck with an ever - growing table that has a GUID primary
key and you're starting to see performance issues on inserts? One thing you can
do is take away the primary key and default clustered index on the GUID column
and just use it as a unique "Marker" column. You would then add a new
column of type int or bigint, and make this the new Primary key. Your inserts
can be much faster, and you can still use your GUID column.
If you really need to optimize for fast inserts, then a table with NO clustered index,
NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database
with Simple Recovery Model, is the best.
There are certainly still situations where using GUIDs as primary keys make sense
(for example, in replication). I don't know about you, but the next time I develop
a SQL Server database schema, I'm going to think long and hard about why I need
to use GUIDS for primary keys. It's a situation that is very difficult to change
once you're in production.