The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

The first chapter of this book summarizes and matches about fifteen years of my own experience and does it in such a short space that I was deeply impressed. I've developed a couple of persistence layers in my career, Hibernate is a breath of fresh air. The first chapter covers so much ground It's worthy of a full textbook in it's own right. So when something later on doesn't match my own experience, then I deeply question my own beliefs. Which is the origin of this post.

Chapter 3, Section 3.4.3 "Choosing primary keys"

Experience has shown that natural keys almost always cause problems in the long run. ... we strongly recommend that new application use synthetic identifiers (also called surrogate keys).

I'm currently in an interesting job situation. I get to be a in charge of a re-write team of a medical database. It currently has around 300 tables, and 300kloc of associated code. The idea of surrogate keys was applied to the existing database to every table without exception.

Needless to say, the problems of natural keys don't exist in the current code. However, surrogate keys create problems of their own. The primary problems are ad-hoc queries and merging data streams. Large numbers of ad-hoc queries are done versus the data. These queries could be greatly simplyfied by using natural keys. Secondly merging two databases (which will start occuring regularly), is a nightmare. Everything from one database has to be renumbered so the two databases fit together, or another column needs adding to every table, kinda like the hi-lo algorithms.

The ad-hoc query thing has resulted in all ad-hoc queries coming back to the IS department. Even with tools, the statisticians have no understanding of surrogate keys. They started off trying to do the queries themselves, but now refuse unless we get rid of surrogate keys on the data of interest.

Also our natural keys don't change. Once data is created, it cannot be deleted or changed--especially the natural key. The whole record can be marked as an error. So the associated natural key problems don't exist. This is a factor of the business that comes about due to our management's interpretation of HIPAA.

I am currently recommending the following: In merge or ad-hoc areas, use natural keys when possible. Otherwise use surrogate keys. I might add, that some of the surrogate keys are inescapable, i.e. client number, employee id, etc. Fortunately the merge doesn't care about employee, only client data, and this is the only thing that will require renumbering if databases are merged, if natural keys are used on client data.

In some of the deeper nested trees, the natural key becomes ugly as it propogates further and further.

Am I right in such thinking? Is there an easier way to merge two databases with surrogate keys? The ad-hoc queries could potentially be supported through views.

Anyway, I'm not sold on going with strickly surrogate or natural keys, and obviously the author is of the opinion that surrogate keys is prefered. I need more evidence/debate to be swayed one way or the other.
Have you looked at Martin Fowler's book, "Analysis Patterns"? Specifically chapter 2. The whole pattern covers many different variations of this theme.

You'll need an intersect table. Unless this is a small project one usually discovers that more relationships are needed. The pattern Fowler uses actually ends up with 4 tables. The table of 'parties', i.e. users, clients, organizations. The table of 'accountabilities', i.e. relationships between parties, this is the intersect. Then a table of 'accountability types' is added, which covers all the different types of relationships that are possible. This table is an intersect of the 'party type' table which covers all possible types of parties.

With these 4 tables, and one interface that allows for editing relationships. We can knock out 6 or 7 different user interfaces in our in system with one code base, the only thing that changes is the 'accountability type' that's being edited. In fact it's so powerful in association, that we were requested not to use it's full power so that confidential data is kept compartmentalized.

However if it's a simple app 2 tables are all that is required. For example:

create table user
(
user_id varchar(10); // must be constrained unique
);

create table friend
(
user_id varchar(10); // Foreign key to above
friendly_with varchar(10); // Foreign Key to above
);

This should give you the basic idea. For the more general case see Fowler's book.

Shawn