Last week I learned all I ever wanted to know about SQL indexes and then some, as a result of a severe performance issue in our lab system.  When I say severe, I mean people, literally, could not work.

Now conceptually I understood indexes and the need for them, but having never really been accountable for the database on a nontrivial level I wasn't really up to speed. (ha.  get it?)  Now I am.

The primary purpose of indexes from our (the application developer) point of view is speed speed speed.  A well indexed database stands up to demand much better than one that is not so.  A well indexed database is performant.  So let's talk about them and what they do from a developer's standpoint.

There are two types of indexes:  Clustered and Non-Clustered.  Very simply put, a clustered index physically alters the table, and a non-clustered one does not.  Clustered indexes order the table based on the index key and control the speed at which the table is scanned, and as a result, control the speed at which all CRUD operations are performed.  Non-Clustered indexes duplicate data outside of the table structure, and can enhance (or reduce) the speed at which a seek is performed.

In the example of a Dictionary (the book not the data structure) think of the order of words within the book as a clustered index, with the words themselves being the key.  Now think of a reference book where there is a more arbitrary order to the data.  The index at the end of the book (or table of contents at the beginning) is a non-clustered index.  It repeats some of the data and allows for faster lookups.

Ok, with that out of the way, let's talk about practical uses.  Not wanting to start a fight or anything here, but, for all intents and purposes, any table of nontrivial size that will be modified (Insert/Update/Delete) should probably have a clustered index.  If not, it's a heap.  A heap isn't necessarily all bad, but it's not all that optimized at the end of the day either.  Now, rules for the clustered index key.  The key should be a)as unique as possible, 2) as "narrow" as possible, iii) as sequential as possible, and e) as static as possible.  Often it's a good idea to make the key some "non-important" piece of data, something that doesn't matter much to the user of the application.  Something that won't ever get changed.  Good candidates are an Identity column, or, in a pinch, a datetime (like a createdon field) that is ever-increasing.  These are great index keys because they match all 4 criteria.  Int/Bigint are certainly narrow datatypes, and datetime isn't that narrow but it's not all that wide either.  Identity fields are guaranteed unique.  A "createdon" type field isn't guaranteed unique, but it's unlikely to be super-duplicated either.  Both are ever-increasing and provide good sorts.  Both are not changeable (well, you *could* change them but typically you don't allow it).  These are great fields for your clustered index.

You may be wondering about GUIDs.  We like to use GUID primary keys because we can generate them in code when we object-map our data.  These are bad clustered index fields.  Yes they are unique, and yes they are static, but they fail the most important test:  non sequential.  Here's what happens when you go to insert a row into a table with a clustered index.  An index scan is performed.  A good place is found to insert the data.  Space, if it doesn't exist, is made.  This is the problem.  It leads to page splits and fragmentation.  A table with a clustered index on a GUID will tend toward 100% fragmentation, and it will do so very quickly.  It's not a perfect analogy, but imagine a 100% fragmented hard drive and what the performance would be like there.  You want to be in the sub 1% fragmentation range when possible.  If you don't have a good field that's a target for a clustered key, make one!  It's not a big deal.  I've done that with tables after the fact, and just added an identity column that nobody but the index cares about, and it does wonders.  Think back to the dictionary example.  Using the word as the clustered index key meets the uniqueness rule, and it meets the static rule, but it doesn't meet the sequential rule (it seems like it does, but it doesn't).  When you have your dictionary all filled in, and then discover that you need to insert a word between two existing words, imagine the work required to shuffle the remaining words down one.  SQL doesn't do this, it just makes a new page and hopes you fill it in later, which, really, you won't.

Ok, non-clustered indexes.  The purpose of these from our perspective is basically to cover your major query usage.  You probably know most of your high usage queries, so start there.  Major targets are foreign key lookups and search fields.  A non-clustered index, like I said above, duplicates the data in the table within the leaf node of the index.  In this world of virtually unlimited storage, that's probably something you don't need to care too much about.  Just don't go overboard.  Last week I had a major perf problem on the database, and just by defragging my clustered indexes, and adding covering non-clustered indexes to the major FK lookup areas and object load areas I am now receiving reports of the application being "faster than ever".

Now before you go just adding indexes willy nilly, take a look at the combination of SQL Profiler and Database Tuning Advisor.  These tools are a can't-live-without pair in your arsenal, so get used to using them.  They're pretty self explanatory, but check out this story for a brief intro on the two together.  Capture a few minutes of SQL activity at a time, run it through the DTA, and see what happens.  Then do the following:  Go to management studio, SET STATISTICS IO ON and SET STATISTICS TIME ON, turn on actual query execution plans, and run your qeuery.  Then create your index.  Then copy and paste your query again, and modify the first one with the WITH (0) hint (if you previously had no covering index) to force one query to execute without the new index.  Run them together and view the stats and execution plans side by side.  You will see whether or not you made a good index choice, and whether or not you have improved things.  (this is also a good technique for refining queries.  Run the original with your refinements side by side and you can compare the execution plans to see if you're heading in the right direction).  The beauty of a non-clustered index is that it doesn't affect your table physically so you can mess with them all day long trying new things and dropping them if they aren't good without worrying about your users (please be careful playing with live data omg).

You also need to go forward with an index maintenance plan.  Tables that receive a lot of inserts are going to become fragmented and less performant over time. You need to plan to rebuild these indexes regularly.  SQL 2005 lets you do this online without doing any exclusive locking (providing certain criteria are met, like no LOB data in the row) but it's probably still better to plan this for downtime or at least a time of low usage.  Creating and maintaining indexes will go a long way toward making sure your database applications are highly available and highly performant.

So, there's your very basic discussion of indexes for developers.  I recommend going much deeper with two of Kimberly Tripp's webcasts from her MSDN webcast series "A Primer to Proper SQL Server Development".  Check out parts 4 and 5 for invaluable information on this topic.  Even if you work with DBAs, this is stuff you should know, because if your apps are suffering performance hits on the database level, ultimately, that's your business.  Arm yourself with the knowledge to defend your domain.

Technorati tags: