shawn (2) [Avatar] Offline
#1
The first chapter of this book summarizes and matches about fifteen years of my own experience and does it in such a short space that I was deeply impressed. I've developed a couple of persistence layers in my career, Hibernate is a breath of fresh air. The first chapter covers so much ground It's worthy of a full textbook in it's own right. So when something later on doesn't match my own experience, then I deeply question my own beliefs. Which is the origin of this post.

Chapter 3, Section 3.4.3 "Choosing primary keys"

Experience has shown that natural keys almost always cause problems in the long run. ... we strongly recommend that new application use synthetic identifiers (also called surrogate keys).

I'm currently in an interesting job situation. I get to be a in charge of a re-write team of a medical database. It currently has around 300 tables, and 300kloc of associated code. The idea of surrogate keys was applied to the existing database to every table without exception.

Needless to say, the problems of natural keys don't exist in the current code. However, surrogate keys create problems of their own. The primary problems are ad-hoc queries and merging data streams. Large numbers of ad-hoc queries are done versus the data. These queries could be greatly simplyfied by using natural keys. Secondly merging two databases (which will start occuring regularly), is a nightmare. Everything from one database has to be renumbered so the two databases fit together, or another column needs adding to every table, kinda like the hi-lo algorithms.

The ad-hoc query thing has resulted in all ad-hoc queries coming back to the IS department. Even with tools, the statisticians have no understanding of surrogate keys. They started off trying to do the queries themselves, but now refuse unless we get rid of surrogate keys on the data of interest.

Also our natural keys don't change. Once data is created, it cannot be deleted or changed--especially the natural key. The whole record can be marked as an error. So the associated natural key problems don't exist. This is a factor of the business that comes about due to our management's interpretation of HIPAA.

I am currently recommending the following: In merge or ad-hoc areas, use natural keys when possible. Otherwise use surrogate keys. I might add, that some of the surrogate keys are inescapable, i.e. client number, employee id, etc. Fortunately the merge doesn't care about employee, only client data, and this is the only thing that will require renumbering if databases are merged, if natural keys are used on client data.

In some of the deeper nested trees, the natural key becomes ugly as it propogates further and further.

Am I right in such thinking? Is there an easier way to merge two databases with surrogate keys? The ad-hoc queries could potentially be supported through views.

Anyway, I'm not sold on going with strickly surrogate or natural keys, and obviously the author is of the opinion that surrogate keys is prefered. I need more evidence/debate to be swayed one way or the other.