Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2020-03-29

How a data model is expressed

Fabian Pascal posts (LinkedIn, Data Modeling, 2020/03/29)
Shown a data model diagram, he says "Actually, it is not a logical model, but a graphical representation of it that users understand -- the DBMS doesn't. Users understand the conceptual model SEMANTICALLY, the DBMS "understands" the true logical model ALGORITHMICALLY and that's not what your drawing is.

Everest responds:

Regarding model (re)presentation, the exact same model can be presented in a variety of ways - graphical diagram, formal linear (for machine processing, I take you call this "logical"). But I would hope and expect that the underlying semantics would be exactly the same. We build a data model, and it should not matter how it is expressed, as long as they are equivalent. The semantics relate to the model, not how it is expressed (Conceptual?). Semantics is the meaning behind what is presented. Furthermore, all models are logical, that is, built according to some set of rules of logic and formal argument, no matter how it is presented or who reads it (check your dictionary). The rules in this case constitute what I call the modeling scheme.

Picking the right users for a data modeling project

Ken Evans said (LinkedIn, Data Modeling, 2020/03/29)
The piece of the puzzle that you have not mentioned is where the users understanding of their domain is rather vague.

Everest responds:

If you have users with a vague understanding of their domain, you are talking to the wrong people. I have found that there are people in the user domain who really do know what is going on, what their world looks like. They are often seen as troublemakers, asking the tough questions, complaining about how things are done (or not done) and suggesting how things could be done better. People on the front lines working in the trenches who actually think about what they are doing, are not satisfied with the status quo, going beyond their job description. Every organization has such people; you just need to find them. And the best way to find them is to ask other users. Most can readily tell you who they are. If there is no one they can point to, you have a dead or dieing organization where nobody cares. The people who fit the profile above will generally not be management or senior level -- who are usually tending to managing and training people, as they should be. Once you get the right people to the table it takes a skilled facilitator to elicit the needed information and document it in a usable form, i.e. a data model, such that they understand and concur with the representation.

Who judges the accuracy of a data model?

Kevin Feeney says (LinkedIn, Data Modeling, 2020/03/29)
How do we know that we have a correct model? My general take would be that we know the model is correct, if the world moves and the model moves with it without breaking anything - if the model is wrong, you'll find duplication and other errors seep in. In terms of how you set the scope for the model and how you deal with systems integration and different views on data - generally you need to go upwards in abstraction and make the model correct at the top level. For example, when you find that two departments have different concepts of what a customer is, both of which are correct from their points of view, there are implicit subclasses (special types of customer) that can be made explicit to make everything correct.

Everest responds:

Kevin, I see two main problems with your viewpoint. (1) Sounds like you are saying design it, build it, and wait for problems to arise. Surely we need to judge correctness before we commit resources to implementation. That would be irresponsible and dangerous. Before implementing the model and building a database, we need to have some assurance that our model is an accurate representation of the user domain. (2) It sounds like you are depending on the designers/modelers to make judgments about model correctness. That is the last thing I would do. Too often I have found that the data modeling experts had only superficial understanding of the user domain. They may be well versed in the modeling task, but that doesn't produce a good model. The best modeling tool in the world and the best modeling methodology would be insufficient to produce a "correct" data model. Rather than a "correct" model I prefer to call it striving for an "accurate" data model, that is, one that accurately represents the user domain. As Simsion argued and I agree, there is no single correct model. So, who best to judge?

So, who best to judge the "correctness" of a data model? I say, the USERS THEMSELVES. They are the ones who understand their world better than anyone else. But you have to get the right users to the table. I have lead dozens of data modeling projects and we only go to implementation when ALL the user representatives sign off and say "Yes, this is an accurate representation of our world." If there are differences, they must be resolved among themselves (with wise direction from a trained data modeler). One caveat: the users must thoroughly understand the data model, in all its glorious detail (not high-level). This is the responsibility of the data modeler to ensure the users collectively understand all the details of the model -- an awesome responsibility. That means the users must understand the model diagrams and all the supporting documentation, particularly the definition of the "things" (entities, objects), relationships (binary, ternary, and more), and all the associated constraints (e.g., cardinalities). Our goal is to develop as rich a representation as possible of the semantics of the user domain, and that means having a rich set of constructs to use in developing the model. So far, I see ORM as the richest modeling scheme.

The best way to make this happen is for the user representatives to be part of the modeling team. In fact, they should be the ones in control. Upper management needs to grant release time to those users most knowledgeable about their domain. An experienced data modeler needs to facilitate and guide the modeling process and the development of the data model. The team needs to be allowed to meet and deliberate as long as necessary to arrive at a model which they all feel comfortable approving. In my experience the users have always known when they were done (and ready to go to implementation), although the time it took was difficult to predict up front. Only in one project were we unable to come to agreement and that is because we had the wrong user representatives at the table. They were little more than data entry clerks who really didn't understand the meaning of the data, why it was important, nor how it was used.

2020-03-26

Determining the "Correctness" of a data model

LinkedIn, Data modeling, 2020/3/26
I asked: How do we know when a data model is correct?
Ken Evans responded: That's easy, when the model conforms to stated requirements.
I then asked "who determines/documents the requirements?"
Ken responded:
It does not matter "who" determines the requirements. The point is that you can only judge whether a deliverable is "correct" if you have a set of pre-established requirements against which to assess "correctness". This principle has been widely accepted in the quality management discipline since 1978 when Philip Crosby published his book "Quality is Free." Crosby makes the point that "Quality is conformance to requirements."

Everest Responds:
Sorry about the "who", perhaps I should have said "how." While I accept the general principle, is it realistic? Crosby's statement begs the question, if quality is conformance to requirements, then who/how do we determine the correctness and completeness of the stated requirements? I have yet to see anything close to an a priori statement of requirements that was sufficient to judge the correctness of the end result, i.e., a domain data model. Furthermore, I have yet to see any guidelines sufficient for preparing a statement of requirements for a data modeling project. I would love to see any examples.
.. To me, the only satisfactory "statement of requirements" sufficient to judge the correctness of the model would be the final, detailed data model itself. Anything less than that would not be sufficient to express the full set of semantics to be included in the final model. In the case of ORM perhaps, a complete set of elementary fact sentences, with well defined object types and predicates. But that is what the entire process of data modeling is all about -- to discover and document the semantics of some user domain. We want to capture at least as rich a set of semantics as possible given our modeling scheme (which is why we need to use a modeling scheme such as ORM which captures many more semantics than any other scheme, including ER/relational).
.. So the question remains, whether we are talking about the data model, or the requirements for a data model -- how do we judge the correctness of a data model? Who is in the best position to do that?

2020-03-25

Is data modeling Design or Description?


Ken Evans asks (LinkedIn, Data Modeling, 2020/3/25)

In the book "Data Modeling: Theory and Practice" describes the result of his extensive research into the "Design or Description?" question. Graeme's research showed that whilst many people believe that a data modeler's job is to "describe" a reality that is out there, the truth is that data modelers are designers rather than describers.

Everest Responds:

I do not intend to contradict what Simsion said.  If modeling was purely a descriptive activity it would be easy, although we would have differing points of view, different interpretations.  It is the differences in these descriptions that means it is best considered a "design" activity -- precisely because there are many choices to be made in building the model.  The modeler CHOOSES how to represent that domain in building the model.  So the question remains, how do we know when we have a correct model?  Interesting that Simsion maintains that there is no one correct model, hence a design activity.  I agree with that, which is why I posed the question about the goal of data modeling  is to find THE correct model.
Or perhaps one would be open to finding A correct model! However, that still doesn't answer the question: How do we know it is A correct model?

Is a data model a "representation"? ...of what?


Ken Evans asks (LinkedIn, Data modeling, 2020/3/25)
Hmm. Is a data model a "representation"?

EVEREST RESPONDS:

First of all I don't like the phrase "Data Model" (sorry "Ted" Codd).  It suggests that it is a "model of data."  That is misleading to someone outside of our community.  It is only a model of data if we have some data.  Then the model would be an (abstract) "representation" of the data.  For us, a "data" model is a model of some aspects of a domain of interest to a community of users, real (world) or imagined/desired/yet to be built.  It is a model "in data," that is, built using informational constructs, all guided by a modeling scheme.  The modeling scheme tells us what to look for in the domain and how to represent it in the model.  So we identify a population of similar things, give it a label and a definition, and put a box or circle into a diagram to represent that population of things.  We build up or "design" a model with lots of types of things, add relationships among those things, and constraints on those things and relationships.  The modeling scheme tells us how to represent those relationships and constraints in our model.  Graham Witt adds some light to this argument by calling it a "business data model."

2018-05-15

Data is like the corn flakes

In our information systems and IT shops, computers, communication networks, programs, systems, programmers, indeed, all IT/IS personnel, are merely "factors of production." We don't "eat" the factors of production of corn flakes - the bins which store the corn and other ingredients, machines which roll out the corn, which mix the ingredients, and fill the boxes, and the belts and pipes that move the product in the warehouse, and even the warehouse. No, we eat the corn flakes, just like we "eat" the data. It is the output from our computer information systems which is of value.

2018-05-14

Data must be managed as a asset

by Martha Lemoine, on the Data-centric manifesto.

Everest responds:

Yes, we often say this but what does it really mean. Look at the definition of Asset in accounting. Organizations spend money. We call these expenditures. The general assumption is that it benefits the current accounting period (year) because it is basically helping to generate revenue for the same period in which it is spent. So we then call it an expense. However, if it is a relatively large expenditure which benefits future years, say a new building or piece of machinery, then accountants want to capitalize the expenditure and spread it as an expense (depreciation) over multiple years, offsetting future revenue it helped to generate. A capitalized expenditure is called an asset. Capitalizing an expenditure, pushes an expense into the future, thus increasing revenues for the current accounting period or year. .. Now in the case of data, does it benefit future time periods, i.e., help to generate revenue in the future? Of course. So what expenditures are made to collect, store, process, retrieve, maintain, update, etc.? Should they be capitalized? That is the hard part. Nobody does, except maybe a firm whose business is selling data, but most of the time they don't. So we must be careful calling data an "asset." Nevertheless, we can certainly call it "something of value" which, as you say, must be managed as an asset.

2018-03-12

Data Modeling and NoSQL


To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.
  The main purpose of data modeling is to understand the business, some application domain, some user world.  The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships.  A secondary purpose is to build a database to contain information which pertains to and describes that domain.  Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system.  That could be a relational DBMS, or a NoSQL tool.  Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model.  However, increasingly the data already exists in some form.  Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain.  NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!).   Either way, you must understand the business in order to make sense of the data. 
BOTTOM LINE:  it is vitally important to understand the principles, methods, and approaches of data modeling which can help you get to an understanding of the data and the business domain it represents.  Whether using traditional database management tools or NoSQL tools you still need to understand the data.

2018-01-02

Data modeling course in a nutshell

A former student asks:

..    I took your class last spring and am currently in talks with some companies for a junior database position.  I find it a little challenging when it comes to talking about exactly what I learned in your class as we covered a lot of much valuable information. If you have any input for a quick way to describe all of the information I learned that would be very helpful.

Concerning Data Model Madness: what are we talking about?

Martjin posts (2012/10/11)

..   There have been endless debates on how to name, identify, relate and transform the various kinds of Data Models we can, should, would, must have in the process of designing, developing and managing information systems (like Data warehouses). We talk about conceptual, logical and physical data models, usually in the context of certain tools, platforms, frameworks or methodologies. A confusion of tongues is usually the end result. Recently David Hay has made an interesting video (Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this issue in a consistent and complete manner. But on LinkedIn this was already questioned if this was a final or universal way of looking at such models.
..   One of the caveats is that a 'model' needs operators as well as data containers. Only the relational model (and some Fact oriented modeling techniques, which support conceptual queries) defines operators in a consistent manner. In this respect Entity Relationship techniques are formally diagramming techniques. Disregarding this distinction for now we need to ask ourselves if there are universal rules for kinds of Data Models and their place within a model generation/transformation strategy.
..   I note that if you use e.g. FCO‑IM diagrams you can go directly from conceptual to "physical" DBMS schema if you want to, even special ones like Data Vault or Dimensional Modeling. I also want to remark that there are 'formal language' modeling techniques like Gellish that defy David's classification scheme. They are both ontological and fact driven and could in theory go from ontological to physical in one step without conceptual or logical layer (while still be consistent and complete btw, so no special assumptions except a transformation strategy). The question arises how many data model layers we want or need, and what each layer should solve for us. There is tension between minimizing the amount of layers while at the same time not overloading a layer/model with too much semantics and constructs which hampers its usability.
..   For me this is governed by the concept of concerns, pieces of interest that we try to describe with a specific kind of data model. These can be linguistic concerns like verbalization and translation, semantic concerns like identification, definition and ontology, Data quality constraints like uniqueness or implementation and optimization concerns like physical model transformation (e.g. Data Vault, Dimensional Modeling), partitioning and indexing. Modeling/diagramming techniques and methods usually have primary concerns that they want to solve in a consistent way, and secondary concerns they can model, but not deem important (but that are usually important concerns at another level/layer!). What makes this even more difficult is that within certain kinds of data models there is also the tension between notation, approach and theory (N.A.T. principle). E.g. the relational model is theoretically sound, but the formal notation of nested sets isn't visually helpful. ER diagramming looks good but there is little theoretic foundations beneath it.
..   I personally think we should try to rationalize the use of data model layers, driven by concerns, instead of standardizing on a basic 3 level approach of conceptual, logical, and physical. We should be explicit on the concerns we want to tackle in each layer instead of using generic suggestive layer names.
I would propose the following (data) model layers minimization rule:

A layered (data) modeling scenario supports the concept of separation of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers using a minimum of modeling methodologies and notations.

Physical data model governed by business (logical) data model

Jaunine Conradie (student) asks: (email to me on 2012 Aug 13)
    The typical Teradata architecture includes a staging layer (stages data from sources in the form that they exist on the source), an integration layer (3NF integrated data from stage), and a semantic layer (dimensional view on the integration layer data).
.. We are two modelers on the project. The problem that we have is that the ETL lead is demanding a model change to facilitate data loads and queries.
.. In short, a Subscriber on the network (Vodacom here in South Africa, which is like Sprint or Verizon in the USA) have various Access Methods of various Access Method Types.  Types can for example be Primary (cell phone number), Data (for internet access), etc.
The change being requested by the ETL lead is to host the Primary access method on Subscriber.

This is problematic because:
1. This is inconsistent with the implementation, where some access methods for a subscriber will now be carried on the Access Method Subscriber entity, and some on the Subscriber entity itself.
2. We do not (and should not) carry history on the Subscriber entity.
3. Teradata has specialized index types to facilitate joins (this negating the request to denormalize because of query performance)


How do we convince the program manager to not let ETL‑related considerations dictate the design of the physical model?

Definitions of concepts should start with their supertype. Why not?

Andries van Renssen -posts on LinkedIn. 
A good definition of a concept should build on the definition of more generalized concepts. This can be done by two parts of a definition:
1. Specifying that the concept is a subtype of some other more generalized concept.
2. Specifying in what respect the concept deviates from other subtypes of the same supertype concept.

Such kind of definitions have great advantages, such as:
‑ The definition of (and the knowledge about) the supertype concept does not need to be repeated, as it is also applicable ("by inheritance") for the defined (subtype) concept.
‑ The definition implies a taxonomy structure that helps to find related concepts that are also subtypes of the same supertype (the 'sister‑concepts'). Comparison with the 'sister‑concepts' greatly helps to refine the definitions. Furthermore, the definitions prepare for creating an explicit taxonomy.
‑ Finding the criteria that distinguishes the concept from its 'sister concepts' prepares for explicit modeling of the definitions, which prepares for computerized interpretation and the growth towards a common Formal Language.
--Thus, why not?

Can we argue more about business rules?


Posted by Kamran Ayub - Thursday, May 6, 2010, 06:24 PM

I think it's interesting; it's too bad we waited so long to argue about it.  Let me say first off: I understand both arguments. I have my own opinions, but I can definitely see both sides to this.  First, I think we should establish some FACTS that we can all agree on:
1.                   No DBMS currently available (for at least production-oriented apps) can completely express every business rule we need to express innately. (Src: experience and Everest)
2.                   No matter which way you slice it, business rules eventually get turned into code somewhere whether it's at the DBMS or in application logic. At some point, there's a piece of machine language where the rule is expressed and validated in 1s and 0s.


There, now that we agree on those two points, what exactly are we arguing about?  If you believe Fact 1, then it follows that it is not possible to ever express all our business rules in the DBMS (at this current state in time).
Here's my view:  All DBMS do enforce a limited set of business rules. Thus, if you argue that your application code holds all business rules in one central place, you are wrong because the DBMS enforces a few of those constraints (mandatoriness/uniqueness/etc.). Most of us end up duplicating these rules to avoid catching exceptions, without thinking about it.  Secondly, you cannot guarantee that there is only a single point of entry to your data. Don't argue with me; there is a sysadmin account somewhere on the system the data is stored and at least that account has full authority to change the data. How many of us input sample data into our SQL databases to fix a data error, sample data, or test triggers? Exactly. Besides, let's not forget you had to define that structure somehow! It may be true your business users can't access that data directly, but somebody is the data steward and she has full control, rules or no rules. By the way, what about unauthorized access? Whoops, guess that hacker didn't care about all your application-enforced business rules! It's fine if you argue that the hacker could completely destroy your system but you must agree that if your logic is stored in the application, at some point due to the nature of TCP/IP that your data is out in the open between your app and your database. That could potentially be tampered with, and your application can't do anything about it.  
So now, the argument is not whether it's better to store business rules in application logic, the question is why can't we put those rules in the DBMS itself, innately? The answer to that is simply because no DBMS allows that just yet.  The reason I think people argue for storing rules in the application is simply because there's really no other better alternative right now. If a .NET developer is comfortable with .NET and not with T-SQL, where do you think she'll put her business rules? In the .NET application or some other application gateway (not the DBMS). This is what I heard today in our discussion. We put all our rules in application logic because it's the best central repository we have... when really the best place should be the DBMS. After all, what have we been learning all along in this class? ORM is able to capture many of these rules... and yet, not all. Business rules are there to establish integrity with our data so they should be in the DBMS.
I believe it's useless to argue for or against storing rules in application logic because right now it's impossible to put them all in the DBMS. Even if you wrote stored procs or triggers for everything, that's not enforced by the DBMS innately (triggers are closer to automatic) plus the rules are still in code.  Furthermore, I believe most people put their rules in application code because it's just harder to do it in the DBMS and that many enterprise or even regular applications can target different storage systems (one application could use XML, MSSQL, MySQL, DB2, or MS Access as repositories). Imagine duplicating all your business rules across those systems... it's much easier to store the rules in the code, isn't it?  Let me be clear: I am not advocating that it is right to store business rules in app logic. I also believe that it is best to put business rules in the DBMS. All I am saying is, what choice do we have at the moment? Until a better DBMS is created, preferably using ORM as the tool to create the DB then we're SOL. Let the discussion begin!

EVEREST RESPONDS:

.. I love it.  I agree with all you have said. I just would not characterize it as "two sides of an argument." Both "sides" must spring from an environment of understanding of the basic concepts. There are two main traps we fall into: 

1. Believing that someday we can reach the ideal state of a perfect world.  Your discourse is filled with those thoughts, e.g., "no DBMS currently available..." (implying that one may be in the future), "at this current state in time" (suggesting that there may come a time), "no DBMS allows that just yet" (but perhaps a future DBMS product might?), "right now it is impossible" (with the implication that in the future it will be), "until a better DBMS is created" (better yes, but never perfect or complete).  "Man" (which includes woman) is imperfect, and therefore, so are his constructions, our DBMSs, our databases, our code, our systems. So try as we might to improve (and we should), we must always be alert to imperfection, weaknesses, failures, attacks, (whether inadvertent or deliberate, intentional) and build in countermeasures.  That is why we back up our data, build in redundancy, and why every government (every locus of power) must have "her majesty's loyal opposition" and the watchful eye of citizens and the media.

2. We tend to use, to favor that which is more familiar to us. If we are a programmer and that is what we know how to do, we will get things done by writing code in our favorite programming language. If we know how to use the full functionality of some DBMS, then that is what we will do as much as we can.  Some application coders need to be stretched to consider an alternative way of doing things that might get us closer to the ideal of serving people (management, organizations) AND incorporating appropriate safeguards (which also will never be perfect). 

.. The business rules and integrity constraints are those mechanisms that can lead to higher quality data, greater user confidence (in our databases, our systems, our IT departments, and our organizations), better decisions, and more effective and efficient operation. As you rightly state, ALL that happens in a computer system is done by executing code. But that is only half the story.  ALL that executing code is driven by specifications, definitions, declarations, input parameters, in other words "data" more broadly interpreted. Even the program code is the specification of a process. The accuracy and completeness of the process depends on the specifications, the directives, etc. 

.. As I show in my lecture on Business Rules (slides BRules.23 and .27), there is a hierarchy of levels where this information, the business rules, can be expressed:
1. DBMS
2. Stored Procedures, whose execution is under the control of the DBMS.
3. Code (methods) embedded (encapsulated) in "objects" which execute in a well-managed and controlled environment (OO, SOA, SaaS...)
4. User application code
5. External user actions (manual or sysadmin)
.. At all these levels the processes are driven by definitions, specifications.  The efficacy of those processes depends on those specifications, and we (humans) understand those processes through documentation which itself is usually quite imperfect.

.. Moving down these levels we encounter decreasing levels of control and increased dispersion. I put DBMS at the top, because that is the closest to the data and the most likely centralized point of control (least dispersion). Our objective is (or should be) to move these boundaries down so that work gets pushed up into more controlled environments. Another thing that happens as we move through these levels is increasing abstraction.  When we declare database integrity constraints, that is a specification (abstraction) of some underlying process (executed by code... somewhere). We (humans) do not need to know what goes on under the covers, we merely need to have confidence that it will be done correctly and consistently, as reflected in the definition or specification.  For example, we may understand the meaning of the square root operator, but how it gets executed on a computer is quite complex. I want to be hidden from that complexity.  When we write a program we call all sorts of subroutines or functions, which are defined at a pretty high level (of abstraction).

.. To the comment of "too bad we waited so long": The focus of my course is data modeling. The subject of a companion course would be DBMS and Database Administration. This issue we are discussing would properly belong in the second course.  However, I raise the question here because the success of maintaining the integrity of our databases according to our defined semantics depends upon the execution environment, the DBMS ++.

.. Let the discussion continue.

2017-12-22

Natural or surrogate keys – selection criteria


In a post on the Data Modeling forum in LinkedIn (2012/05), Bryan had 2 criteria for picking a key, Tony added a third and Kim had 4. Let me give you my nine which I put forth in my advanced database design class at the University of Minnesota (see http://geverest.umn.edu):

CRITERIA: (conflicts necessitate compromise)
* Unique ‑ Guaranteed!
* Ubiquitous ‑ have a value for every entity instance, ever. In the US the Social Security number was established in 1935 but we are now running out.  As another note, the SSA had no mandate to ensure that it was unique until many decades later ‑‑ not bad with a 75 year run!
* Unlimited ‑ won't ever run out of values; reuse?
* Unchanging, never changes ‑ "immutable"
* Under your own control ‑ manage the values/codes, or at least some independent body exists to assign and maintain codes to unsure uniqueness. E.g., the Universal Postal Union (http://www.upu.int) maintains the official names of all countries in the world. New countries must apply for and have approved a name, or a change from an old name or else they will be unable to exchange mail with other countries.
* Used by people in the user environment, if not, can you come up with one that satisfies the next criterion?
* 'Mnemonic' (easy for people to recognize/remember/generate)
* "Dataless" ‑ sole purpose is to identify ‑ should carry no other information; don't overwork an identifier
* Compact, easy and efficient to process and store ‑ often means 'numeric' and 'fixed length' (secondary)
> else, invent a surrogate (“Autonumber”) ‑ only if none exists naturally
Anyone have a 'U' word for any of the last three???

These criteria will usually get you close. If you can't get close, then take the 'else.'  It is important to distinguish the need for an identifier for users as expressed in a business oriented data model, and the model to be implemented in some DBMS. These two purposes are somewhat at odds with each other.  If you can't come up with a suitable natural key (according to the above criteria, and some compromise is usually necessary) then use a surrogate key. Then some sort of lookup would be needed for the users to find the surrogate key based upon the best information they have when doing a search. For almost any sort of natural key, you can nearly always find examples which make things difficult. e.g., we run out of numbers, a state changes the name of a county (so any number ID assigned based upon alphabetical order becomes invalid), we change characteristics on which the ID was assigned (e.g., should color be in the unique ID for an appliance? It usually is but I can paint my refrigerator ‑‑ color is an attribute, and should not be part of the ID).

One more comment. When we speak of having an identifier, it will only be unique within a defined and finite name space, that is, a particular context. If we go out of that name space we need to qualify the ID with an additional identifier of the expanded context, ad infinitum. 

TABLE THINK - multiplicity (manyness) in a relationship?

Suppose we have a relational table - [ X | A | B | ... ] where A and B are attributes of X.
Two common confusions:


(1)  First, we can say that there IS a relationship between X and A. Since we observe that for each X there is only one A (1NF), but for each A there could be many Xs, we may be led to say there is a one‑to‑many relationship between X and A.  However, that relationship is actually many X to one A, thus many to one, not one to many, as stated.  When you say one to many (between X and A), I must apply them respectively, otherwise, confusion results from the ambiguity and inconsistency.  So the relationship is Many to One between X and A, respectively.


(2) We observe that for each value of A there could be multiple different values of B in the table of X and vice versa.  Does that mean there is a many‑to‑many relationship between A and B?  Any discussion of oneness or manyness must presume there is a relationship.  When you say there is a M:N relationship between A and B, then I must be able to list all the valid pair wise associations as in a table with two columns.  And being M:N the same value of A can appear in multiple rows and the same value of B can appear multiple times.  Of course the exact same values of A and B on a single row could never appear more than once since it would be the same instance of the relationship.  Now for any given X, there is a value for A and a value for B.  Would that particular value of A be the one that is related to that value of B?  NO.  Furthermore, could there be the same pair of A and B values appearing on multiple rows in X?  YES.  So the juxtaposition of A and B in X does not represent a relationship between A and B.  It is spurious.  IF there is a relationship between A and B, it must be represented in another table, and could be 1:1, 1:M, M:1 or M:N.  But since we said that we know everything about X, A, and B and it is represented in the diagram, then there must not be a relationship between A and B.  Again, you can look at the table of X and see that there are multiple different values of A appearing beside the same value of B and vice versa.  However, that in no way means there is a relationship between A and B represented in the table of X.  

Physical Data Model vs. Implemented Database Schema ‑ To what degree should they match?

 By: Brent  (Linkedin Discussion: Started 2011 Nov. 9)
. . We're having an internal discussion about PDM standards. The existing standard says that, for all tables, columns, keys, indexes, and relationships, the PDM should match what was implemented in the physical database. We have a huge Teradata database that has no primary key constraints, foreign key constraints, or indexes implemented in the database, but they are still in the data model. Those components are part of the application code used to load the tables, so the total database implementation = database schema + data load rules, and the standard is satisfied. Agree? Disagree? What's your standard?

EVEREST RESPONDS:


First recognize that a data model goes through several stages from a detailed “conceptual” representation of the users’ world to final implementation in some DBMS.  This question suggests there are at least two stages.  Ideally, as the data model is transformed from one stage to the next, the (user) semantics are retained (perhaps with differing syntactical representations), but additional features and constructs may be added.  In general, the model at each stage will be different than the previous stage.  So I would answer that they need not and should not match, except in the semantics of the users’ world. 
Note that user semantics does not include things such as clustering data items into records/tables, converting the representation of relationships into foreign keys, adding indexes, surrogate keys, enforcing first normal form (other normal forms are still relevant), etc.  These are all introduced for the convenience of the system to be implementable in some data modeling scheme (such as relational) and some target DBMS, and reflecting storage efficiency and performance goals.  Note that none of these have anything to do with the users’ view in a data model, and should not be reflected in that first stage model (some call this the “conceptual” data model).  Here the notion of conceptual should not be confused with a high-level, or abstract presentation of a data model.  While we may present a data model in a high-level, abstraction, the underlying model needs to be as detailed as necessary to capture as much as possible of the exogenous semantics of the users’ world.  That model should have no consideration or constraints relating to physical implementation.
Back to the question.  The implemented database schema (in some target DBMS) may have some additional elements defined in a way that is peculiar to that system, but need not be expressed in the physical data model.  While the physical data model is a prelude to implementation it is primarily a vehicle for communication with people, to understand the data model.  The physical data model can include generic elements of a physical representation but not in the way which might be represented in the implemented schema. 
Of course, it is important to document (for people) the transformations (mappings) made from one stage of a data model to the next, and the rationale for those choices.  This documentation is key.  From the discussion so far, people seem to be assuming that all you have are the two data models, and the issue is to what extent are they different or should they be the same.  If you want them to be the same, then I would argue for two physical data models - one generic, and one tailored for the target DBMS implementation.  However, this is more work and probably not very useful.  So keep them different, and document the differences.  The tool (DBMS or data modeling tool) should have some ability to produce a human readable representation of the stored schema, even in a graphical form.  We need both logical/physical models and “logical maps.”
We should note here that data modeling schemes and DBMSs have their limitations (often chosen for efficiency reasons).  Hence, we need to transform or fudge or introduce (sometimes spurious) constructs to allow certain semantics to be represented in the next stage model.  For example, to transform an ER model to a relational model we must impose the “single atomic value” constraint on attributes (1NF) which means we must resolve all M:N relationships into two 1:M relationships with a intersection entity.  Also, we must represent all relationships with foreign keys, which is another reason why we cannot represent M:N relationships in a relational model.  None of this is necessary in the users’ conceptual model.  People don’t have a problem understanding M:N relationships!  Only a relational DBMS does!  One approach to implementing complex user constraints which are not expressible in the target DBMS is to embody it in triggers and stored procedures, or worse yet, in application code.  One example might be enforcing referential integrity, if that feature is not available in your target DBMS.   Now the semantics really get buried and difficult for users to comprehend.
To Alan Howlett’s point, if we try to (manually) maintain a PDM and an implemented schema which are intended to be the same, that is simply an invitation to introduce inconsistency.  If they are to be the same then they must be stored as a single data model.  Hopefully you have a tool which can generate user-friendly representations of the model, a graphical representation, or a schema for your DBMS (multiple?).

Logical vs. Physical data model

Jaunine Conradie (former student) asks:
. . Following the differences between logical and physical data models (e.g., include many‑to‑many relationships thus excluding association entities except if attributes are associated with the association entity)...
Here's the question: If the task is to draw a relational model, do I draw a logical or a physical data model?
If the answer is a logical data model, then I expect to not see association entities and thus also not the foreign keys that would have been included in the association entity.
If the answer is a physical data model, then I expect to see every attribute, as it would be implemented in a specific RDBMS (thus, including data types), based on a full analysis of all available attributes and their values.

Everest response:
    First, I think we need to drop the term "logical" model.  All data models are logical, and some have a physical representation by being implemented in some DBMS. The distinction between logical and physical models arose several decades ago in an effort to rise above the current data modeling schemes which were just trying to draw pictures of a database as implemented and stored physically.  Today we have all sorts of "logical" data modeling schemes.
. . What is important is that every data model is done with and conforms to some modeling scheme, whether logical or physical.  So we can have logical models which are relational, network, hierarchical, single flat file, extended network (allowing M:N relationships), ER, Extended ER (which included ternary relationships and sub/supertypes), multidimensional,... and Fact Oriented Models like ORM.
. . Terry Halpin refers to Relational models as "logical" and ORM models as "conceptual."  OK, but don't get hung up on the terminology; it is the semantics that are important, not the syntax, i.e., the choice of words to describe semantic concepts.


. . If the answer is a logical model, then it will be drawn according to some modeling scheme, which may be relational, in which case in includes foreign keys and intersection entities for M:N relationships.  This logical model could include attribute data types, although that information is getting closer to the physical.  Even though we call ORM conceptual, we do specify data types for the objects!
. . If the answer is a physical model we will generally include more information relating specifically to some DBMS, for example, conforming to the naming conventions and the available data types in the DBMS.  I also suggest dropping the term "physical" model, preferring instead "implementation" model.

. . The truth of the matter is that any data model can be represented over a range of stages, from conceptual to physical, and at each stage it conforms to a particular modeling scheme.

On the origin of the fork notation

A student posts: 
    Dr. Everest, I just caught the fact that you developed the "fork" notation.
Good thinking, it is the most clear and simple of all 1:M relationship notations by far!
‑It occurred to me that maybe the profile of a badminton birdie was your inspiration?

Everest response:

    At the time when I did that (in a 1976 paper*, which became chapter 4 in my Database Management book, McGraw-Hill, 1986, §4.3.2, p.132) my idea came from trying to resolve the difference between the then dominant diagraming scheme of an arrow from parent to child (due to Charlie Bachman in a 1969 paper), and Nijssen drawing the arrow in the opposite direction to represent a function.  What I did was move Nijssen's arrow head to the other end of the arc and voila, a fork!
    
*Here is the reference (click on the title to get a copy of the paper):
Gordon C. Everest, "BasicData Structure Models Explained With A Common ExampleComputing Systems 1976, Proceedings Fifth Texas Conference on Computing Systems, Austin, TX, 1976 October 18-19, pages 39-46. (Long Beach, CA: IEEE Computer Society Publications Office).  
            Original paper using a notation for relationship characteristics.  Thought to be the first to use what is now called a "fork" to represent multiplicity in a relationship.  An updated version appeared as Chapter 4 in Database Management:  Objectives, System Functions, and Administration, McGraw-Hill, 1986.

2017-12-20

Why number the rules of normalization? Is it helpful?

Chris Date/Ted Codd originally numbered the rules of normalization, so we have 1NF, 2NF, etc. They were defined in a cascading fashion. For example, for a data structure to be in 3NF, it must satisfy the 2NF rule, which must in turn satisfy the 1NF rule. This definition has persisted in virtually all introductory text books on data management and data modeling, evidently with the authors not applying any critical thinking to the question. The authors just blindly accept what has been written before.


So is it possible for a relation (entity record or table) to have no transitive dependencies (thus satisfying the condition for 3NF) but have a partial dependency (2NF). Would that situation mean that the record/relation did not satisfy 3NF? Is it helpful to say that? Even more, when we speak of nested relations or object modeling (as in UML, for example) they explicitly violate 1NF allowing nested repeating groups of data items or attributes in a record/relation. Does that mean that automatically such structures do not satisfy 2NF or 3NF, even if they had no partial dependencies or transitive dependencies?

The rules of normalization should be expressed or named by the condition which causes a violation.
1NF = NO multivalued dependency (or multivalued attribute thus requiring a many-to-many relationship).
2NF = NO partial dependency, i.e., the data item "attribute" is only dependent on part of a composite key.
3NF = NO transitive dependency on an intermediate data item.
...
Then the rules of normalization are independent and can be applied in any order.