Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2017-12-22

Natural or surrogate keys – selection criteria


In a post on the Data Modeling forum in LinkedIn (2012/05), Bryan had 2 criteria for picking a key, Tony added a third and Kim had 4. Let me give you my nine which I put forth in my advanced database design class at the University of Minnesota (see http://geverest.umn.edu):

CRITERIA: (conflicts necessitate compromise)
* Unique ‑ Guaranteed!
* Ubiquitous ‑ have a value for every entity instance, ever. In the US the Social Security number was established in 1935 but we are now running out.  As another note, the SSA had no mandate to ensure that it was unique until many decades later ‑‑ not bad with a 75 year run!
* Unlimited ‑ won't ever run out of values; reuse?
* Unchanging, never changes ‑ "immutable"
* Under your own control ‑ manage the values/codes, or at least some independent body exists to assign and maintain codes to unsure uniqueness. E.g., the Universal Postal Union (http://www.upu.int) maintains the official names of all countries in the world. New countries must apply for and have approved a name, or a change from an old name or else they will be unable to exchange mail with other countries.
* Used by people in the user environment, if not, can you come up with one that satisfies the next criterion?
* 'Mnemonic' (easy for people to recognize/remember/generate)
* "Dataless" ‑ sole purpose is to identify ‑ should carry no other information; don't overwork an identifier
* Compact, easy and efficient to process and store ‑ often means 'numeric' and 'fixed length' (secondary)
> else, invent a surrogate (“Autonumber”) ‑ only if none exists naturally
Anyone have a 'U' word for any of the last three???

These criteria will usually get you close. If you can't get close, then take the 'else.'  It is important to distinguish the need for an identifier for users as expressed in a business oriented data model, and the model to be implemented in some DBMS. These two purposes are somewhat at odds with each other.  If you can't come up with a suitable natural key (according to the above criteria, and some compromise is usually necessary) then use a surrogate key. Then some sort of lookup would be needed for the users to find the surrogate key based upon the best information they have when doing a search. For almost any sort of natural key, you can nearly always find examples which make things difficult. e.g., we run out of numbers, a state changes the name of a county (so any number ID assigned based upon alphabetical order becomes invalid), we change characteristics on which the ID was assigned (e.g., should color be in the unique ID for an appliance? It usually is but I can paint my refrigerator ‑‑ color is an attribute, and should not be part of the ID).

One more comment. When we speak of having an identifier, it will only be unique within a defined and finite name space, that is, a particular context. If we go out of that name space we need to qualify the ID with an additional identifier of the expanded context, ad infinitum. 

7 comments:

  1. Alexander Titov writes (excerpted): @Gordon, I am really impressed with your list... But a few comments: the fact that Universal Postal Union (UPU) maintains a list of official names of all countries in the world is not really useful if there is a country which does not care at all about UPU... In reality, new countries are not obliged to apply to UPU if they don't want... Your point that "they must..." is far too strong. For example, being a British (I mean citizenship), I am not obliged to apply (and get) a British passport... And I like your comment that uniqueness is defined within a context, which nicely fit to my understanding expressed by quotations in the beginning of my post.

    ReplyDelete
    Replies
    1. EVEREST RESPONDS: 2012/5/23

      Alexander, thanks for your kind words. Regarding the UPU, you are absolutely correct, that a country is not obliged to apply. However, if they don't they will remain isolated to some extent. My understanding is that if a country does not apply for and receive approval of their NAME and agree to the terms of the UPU, then other countries are not obligated to accept mail from that country, or to deliver mail to that country. The primary purpose of the UPU is to establish a framework for the free flow of mail between countries. That means for countries to agree to deliver each other's mail. So when I mail a letter in the USA intended for someone in France, the USA gets all the revenue and France agrees to deliver the letter. Similarly, when someone in France sends a letter to the USA, France gets all the postage and the USA agrees to deliver it. Does anyone have a different understanding? The reality, of course, is that if we send a letter to "Germany," it will probably get delivered even if the official name of the country is different. You are also correct about the passport. However, without one you may have a little difficulty with international travel. As a practical matter, if you want to travel you will likely want to get a passport.
      ... Let me emphasize the bigger point I am making. IF there exists some standard for naming, encoding, etc., that is a compelling reason to use it. At least it should be a considered option.

      Delete
  2. RESPONSE FROM Martijn Evers
    * 'Mnemonic' (easy for people to recognize/remember/generate)
    * "Dataless" sole purpose is to identify should carry no other information; don't overwork an identifier
    * Compact, easy and efficient to process and store often means 'numeric' and 'fixed length' (secondary)

    As alternatives:

    Understandable (recognizable, predictable, memorable)
    Undividable (identifier should not house internal information so that in can be broken up)

    For me "Compact" belongs to a set of implementation characteristics which should be seen separate from the logical criteria. (You could separate the criteria after e.g., Zachman's Framework in several layers in which Compact is in the physical layer).

    I would also add verifiable. Possible to verify the validity of the identifier. Either an internal consistency check (a parity number) or an external check. This usually conflicts with other desirable properties of identifiers.

    I will remark here that the criteria discussed here are from a purely "greenfield" design perspective. Designing/defining Identifiers in brownfield situations (e.g. Data warehousing) you have more criteria to content with like visibility, internal integrity etc.

    Also, I have some reservation against discussing "natural", "surrogate", "whatever" keys. These are just names for classes of keys that meet certain criteria (actually, properties of the keys having certain values). Classifying types of keys is an taxonomic exercise, while assessing their properties (criteria) is an ontological exercise. There is no guarantee that the very simple (binary) key taxonomy (natural vs surrogate) we default to will be mapable to our key ontology when your ontology is somewhat more complex (and like Gordon shows, it usually is). Raging online discussions on natural vs surrogate indicate to me that the current default key taxonomy is not only useless but actually quite dangerous.

    ReplyDelete
    Replies
    1. Martijn, thanks for your suggestions for renaming those criteria.
      I agree that "compact" is more physical implementation. However, it is often used as a criteria in the selection of identifiers. It is minor and should not take precedence or outweigh any of the other selection criteria.

      Delete
  3. Bryan Watson posts · I have been describing the elements of a natural key, not just any "key" (not primary key, not surrogate key, not any database key, no any database anything).
    When I am working in the realm of data models and data architecture, I use terms that are meaningful to the users and to the data but I don't assume that either a database or a computer is or will be present. Hence, DBMS specific notions like primary key and surrogate key are not part of the parlance. Natural key, however, is.
    I believe that Gordon confuses these two the key as referenced by computers and the key as referenced by people. Computers are quite efficient at dealing with encoded, assigned identifiers the particular universe of surrogate keys and less efficient with identifiers used in everyday language. People are just the reverse in fact, people make do with anonymous identifiers (pronouns and articles, like "him", "the room", "you" and the like). I'm not proposing that we use pronouns for natural keys that becomes impossible but naturally occurring property values are appropriate.
    So I associate natural keys with people's common language for that reason, memorability becomes essential. Uniqueness would be ideal, but is unrealistic. Universality, efficiency and most of the other properties that Gordon cites are good for computers, good for programmers, but not good (or not necessary or not relevant) for people. In short, unnatural.
    This is all based on my assumption that data modeling is about people and data, and not about computers, databases, programming or other automation forms. There is a model for that computer space database models, logical models, physical models are examples but the data model does not presume that a computer is present.
    Perhaps that is not a shared assumption.

    ReplyDelete
    Replies
    1. EVEREST RESPONDS: 2012/5/23
      .. To Bryan: Wow, let me clarify. The term "natural" is perhaps too ambiguous. It depends on where you are coming from, the context, and your purpose. My focus is always on modeling the users' world with no consideration for computerization, programming, or anything else relating to implementation, or stored representation. In fact, it is important that these considerations be explicitly excluded from the early stages of modeling.
      ... You say you want to use terms that are "meaningful... to the data" and that "data modeling is about data." I disagree. It is unfortunate that we even use the term "data model" which implies that it is a model of data. We are not modeling data at all; we are modeling the things of interest in the users' world and we represent that in data, it is not a data model but a MODEL IN DATA OF <...>.
      ... First, a model is an abstract representation of something abstract because it leaves stuff out (reality is infinitely complex), and a re-presentation of that reality. The reality presents itself to us (or the modeler), we capture our perceptions and cast them into a model, and present the model to others (business users, programmers, and database implementers). Building models is driven/directed by some chosen modeling scheme, which specifies the constructs and expression of constraints used in building models. For example, ER, Relational, ORM, IDEF1X... are modeling schemes (usually with their own rules for building models and notational conventions).
      ... What is real in the world is that there are individual things concrete, abstract, etc. As modelers we group things into populations (types, classes) based upon commonality of characteristics. Forming populations is ultimately something the modeler "imposes" on the world. At this point we need a lexical vocabulary to talk about (represent) types of things Employee, Department, ... and a definition of the populations. So far we have said nothing about identification or keys. We have included populations of things, relationships among those things, and constraints on those populations and relationships.
      ... Now within a population we must have some way to distinguish members of the population. Since it is hard to put people or products into our computer storage devices, we must devise some lexical surrogates (data) which can be a handle for the members of the population. I don't care whether it is natural or what, we still need one. If the users already have one in their environment, we should consider using it. If they use identifiers which are imperfect (according to my nine rules) we may want to come up with something better. Bottom line is that we must come up with some way to uniquely reference or identify individual instances of a population, if we are to build a model (and ultimately a database, but that is also a choice) to represent the users' world. All of this is before any consideration of computers or implementation.
      ... All of my criteria for selecting unique identifiers (sometimes called keys) are strictly from the standpoint of the user's world. They put forth the ideal, which is seldom completely achievable, but we need to try. If there is one in use and it comes close to the ideal, then use it. If there is none, then devise one according to the criteria.

      Delete
    2. EVEREST continues:
      ... Let me give you an example of a devised naming (encoding) scheme. Minnesota has 87 counties (currently). The full name is unique (by design and controlled by the state government) but cumbersome, variable length, and inefficient. One scheme used was to assign sequential numbers to the alphabetical list of counties. Is that memorable, or rather mnemonic? That would work... until the state renamed a county, split or combined counties, etc. Is that possible? Now, upon closer examination, if you take the first four letters (prefix) of each of the counties, it turns out that they would be unique except for one pair "Lake" county, and "Lake of the Woods." Furthermore, people refer to the latter by the acronym LOTW. Voila, a memorable, and efficient coding scheme, and it even retains the alphabetical sequence of the county names. This scheme is pretty good for people (and computers) and satisfies many of the criteria for choosing or designing "keys." Is that a "natural" key? Was it designed by computer types yes. I suggest that my criteria for picking or designing identifiers or "keys" still holds, and is not presuming implementation, as Bryan suggests.

      Delete

Comments to any post are always welcome. I thrive on challenges and it will be more interesting for you.