Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2020-03-25

Is data modeling Design or Description?


Ken Evans asks (LinkedIn, Data Modeling, 2020/3/25)

In the book "Data Modeling: Theory and Practice" describes the result of his extensive research into the "Design or Description?" question. Graeme's research showed that whilst many people believe that a data modeler's job is to "describe" a reality that is out there, the truth is that data modelers are designers rather than describers.

Everest Responds:

I do not intend to contradict what Simsion said.  If modeling was purely a descriptive activity it would be easy, although we would have differing points of view, different interpretations.  It is the differences in these descriptions that means it is best considered a "design" activity -- precisely because there are many choices to be made in building the model.  The modeler CHOOSES how to represent that domain in building the model.  So the question remains, how do we know when we have a correct model?  Interesting that Simsion maintains that there is no one correct model, hence a design activity.  I agree with that, which is why I posed the question about the goal of data modeling  is to find THE correct model.
Or perhaps one would be open to finding A correct model! However, that still doesn't answer the question: How do we know it is A correct model?

Is a data model a "representation"? ...of what?


Ken Evans asks (LinkedIn, Data modeling, 2020/3/25)
Hmm. Is a data model a "representation"?

EVEREST RESPONDS:

First of all I don't like the phrase "Data Model" (sorry "Ted" Codd).  It suggests that it is a "model of data."  That is misleading to someone outside of our community.  It is only a model of data if we have some data.  Then the model would be an (abstract) "representation" of the data.  For us, a "data" model is a model of some aspects of a domain of interest to a community of users, real (world) or imagined/desired/yet to be built.  It is a model "in data," that is, built using informational constructs, all guided by a modeling scheme.  The modeling scheme tells us what to look for in the domain and how to represent it in the model.  So we identify a population of similar things, give it a label and a definition, and put a box or circle into a diagram to represent that population of things.  We build up or "design" a model with lots of types of things, add relationships among those things, and constraints on those things and relationships.  The modeling scheme tells us how to represent those relationships and constraints in our model.  Graham Witt adds some light to this argument by calling it a "business data model."

2018-05-15

Data is like the corn flakes

In our information systems and IT shops, computers, communication networks, programs, systems, programmers, indeed, all IT/IS personnel, are merely "factors of production." We don't "eat" the factors of production of corn flakes - the bins which store the corn and other ingredients, machines which roll out the corn, which mix the ingredients, and fill the boxes, and the belts and pipes that move the product in the warehouse, and even the warehouse. No, we eat the corn flakes, just like we "eat" the data. It is the output from our computer information systems which is of value.

2018-05-14

Data must be managed as a asset

by Martha Lemoine, on the Data-centric manifesto.

Everest responds:

Yes, we often say this but what does it really mean. Look at the definition of Asset in accounting. Organizations spend money. We call these expenditures. The general assumption is that it benefits the current accounting period (year) because it is basically helping to generate revenue for the same period in which it is spent. So we then call it an expense. However, if it is a relatively large expenditure which benefits future years, say a new building or piece of machinery, then accountants want to capitalize the expenditure and spread it as an expense (depreciation) over multiple years, offsetting future revenue it helped to generate. A capitalized expenditure is called an asset. Capitalizing an expenditure, pushes an expense into the future, thus increasing revenues for the current accounting period or year. .. Now in the case of data, does it benefit future time periods, i.e., help to generate revenue in the future? Of course. So what expenditures are made to collect, store, process, retrieve, maintain, update, etc.? Should they be capitalized? That is the hard part. Nobody does, except maybe a firm whose business is selling data, but most of the time they don't. So we must be careful calling data an "asset." Nevertheless, we can certainly call it "something of value" which, as you say, must be managed as an asset.

2018-03-12

Data Modeling and NoSQL


To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.
  The main purpose of data modeling is to understand the business, some application domain, some user world.  The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships.  A secondary purpose is to build a database to contain information which pertains to and describes that domain.  Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system.  That could be a relational DBMS, or a NoSQL tool.  Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model.  However, increasingly the data already exists in some form.  Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain.  NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!).   Either way, you must understand the business in order to make sense of the data. 
BOTTOM LINE:  it is vitally important to understand the principles, methods, and approaches of data modeling which can help you get to an understanding of the data and the business domain it represents.  Whether using traditional database management tools or NoSQL tools you still need to understand the data.

2018-01-02

Data modeling course in a nutshell

A former student asks:

..    I took your class last spring and am currently in talks with some companies for a junior database position.  I find it a little challenging when it comes to talking about exactly what I learned in your class as we covered a lot of much valuable information. If you have any input for a quick way to describe all of the information I learned that would be very helpful.

Concerning Data Model Madness: what are we talking about?

Martjin posts (2012/10/11)

..   There have been endless debates on how to name, identify, relate and transform the various kinds of Data Models we can, should, would, must have in the process of designing, developing and managing information systems (like Data warehouses). We talk about conceptual, logical and physical data models, usually in the context of certain tools, platforms, frameworks or methodologies. A confusion of tongues is usually the end result. Recently David Hay has made an interesting video (Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this issue in a consistent and complete manner. But on LinkedIn this was already questioned if this was a final or universal way of looking at such models.
..   One of the caveats is that a 'model' needs operators as well as data containers. Only the relational model (and some Fact oriented modeling techniques, which support conceptual queries) defines operators in a consistent manner. In this respect Entity Relationship techniques are formally diagramming techniques. Disregarding this distinction for now we need to ask ourselves if there are universal rules for kinds of Data Models and their place within a model generation/transformation strategy.
..   I note that if you use e.g. FCO‑IM diagrams you can go directly from conceptual to "physical" DBMS schema if you want to, even special ones like Data Vault or Dimensional Modeling. I also want to remark that there are 'formal language' modeling techniques like Gellish that defy David's classification scheme. They are both ontological and fact driven and could in theory go from ontological to physical in one step without conceptual or logical layer (while still be consistent and complete btw, so no special assumptions except a transformation strategy). The question arises how many data model layers we want or need, and what each layer should solve for us. There is tension between minimizing the amount of layers while at the same time not overloading a layer/model with too much semantics and constructs which hampers its usability.
..   For me this is governed by the concept of concerns, pieces of interest that we try to describe with a specific kind of data model. These can be linguistic concerns like verbalization and translation, semantic concerns like identification, definition and ontology, Data quality constraints like uniqueness or implementation and optimization concerns like physical model transformation (e.g. Data Vault, Dimensional Modeling), partitioning and indexing. Modeling/diagramming techniques and methods usually have primary concerns that they want to solve in a consistent way, and secondary concerns they can model, but not deem important (but that are usually important concerns at another level/layer!). What makes this even more difficult is that within certain kinds of data models there is also the tension between notation, approach and theory (N.A.T. principle). E.g. the relational model is theoretically sound, but the formal notation of nested sets isn't visually helpful. ER diagramming looks good but there is little theoretic foundations beneath it.
..   I personally think we should try to rationalize the use of data model layers, driven by concerns, instead of standardizing on a basic 3 level approach of conceptual, logical, and physical. We should be explicit on the concerns we want to tackle in each layer instead of using generic suggestive layer names.
I would propose the following (data) model layers minimization rule:

A layered (data) modeling scenario supports the concept of separation of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers using a minimum of modeling methodologies and notations.

Physical data model governed by business (logical) data model

Jaunine Conradie (student) asks: (email to me on 2012 Aug 13)
    The typical Teradata architecture includes a staging layer (stages data from sources in the form that they exist on the source), an integration layer (3NF integrated data from stage), and a semantic layer (dimensional view on the integration layer data).
.. We are two modelers on the project. The problem that we have is that the ETL lead is demanding a model change to facilitate data loads and queries.
.. In short, a Subscriber on the network (Vodacom here in South Africa, which is like Sprint or Verizon in the USA) have various Access Methods of various Access Method Types.  Types can for example be Primary (cell phone number), Data (for internet access), etc.
The change being requested by the ETL lead is to host the Primary access method on Subscriber.

This is problematic because:
1. This is inconsistent with the implementation, where some access methods for a subscriber will now be carried on the Access Method Subscriber entity, and some on the Subscriber entity itself.
2. We do not (and should not) carry history on the Subscriber entity.
3. Teradata has specialized index types to facilitate joins (this negating the request to denormalize because of query performance)


How do we convince the program manager to not let ETL‑related considerations dictate the design of the physical model?

Definitions of concepts should start with their supertype. Why not?

Andries van Renssen -posts on LinkedIn. 
A good definition of a concept should build on the definition of more generalized concepts. This can be done by two parts of a definition:
1. Specifying that the concept is a subtype of some other more generalized concept.
2. Specifying in what respect the concept deviates from other subtypes of the same supertype concept.

Such kind of definitions have great advantages, such as:
‑ The definition of (and the knowledge about) the supertype concept does not need to be repeated, as it is also applicable ("by inheritance") for the defined (subtype) concept.
‑ The definition implies a taxonomy structure that helps to find related concepts that are also subtypes of the same supertype (the 'sister‑concepts'). Comparison with the 'sister‑concepts' greatly helps to refine the definitions. Furthermore, the definitions prepare for creating an explicit taxonomy.
‑ Finding the criteria that distinguishes the concept from its 'sister concepts' prepares for explicit modeling of the definitions, which prepares for computerized interpretation and the growth towards a common Formal Language.
--Thus, why not?

Can we argue more about business rules?


Posted by Kamran Ayub - Thursday, May 6, 2010, 06:24 PM

I think it's interesting; it's too bad we waited so long to argue about it.  Let me say first off: I understand both arguments. I have my own opinions, but I can definitely see both sides to this.  First, I think we should establish some FACTS that we can all agree on:
1.                   No DBMS currently available (for at least production-oriented apps) can completely express every business rule we need to express innately. (Src: experience and Everest)
2.                   No matter which way you slice it, business rules eventually get turned into code somewhere whether it's at the DBMS or in application logic. At some point, there's a piece of machine language where the rule is expressed and validated in 1s and 0s.


There, now that we agree on those two points, what exactly are we arguing about?  If you believe Fact 1, then it follows that it is not possible to ever express all our business rules in the DBMS (at this current state in time).
Here's my view:  All DBMS do enforce a limited set of business rules. Thus, if you argue that your application code holds all business rules in one central place, you are wrong because the DBMS enforces a few of those constraints (mandatoriness/uniqueness/etc.). Most of us end up duplicating these rules to avoid catching exceptions, without thinking about it.  Secondly, you cannot guarantee that there is only a single point of entry to your data. Don't argue with me; there is a sysadmin account somewhere on the system the data is stored and at least that account has full authority to change the data. How many of us input sample data into our SQL databases to fix a data error, sample data, or test triggers? Exactly. Besides, let's not forget you had to define that structure somehow! It may be true your business users can't access that data directly, but somebody is the data steward and she has full control, rules or no rules. By the way, what about unauthorized access? Whoops, guess that hacker didn't care about all your application-enforced business rules! It's fine if you argue that the hacker could completely destroy your system but you must agree that if your logic is stored in the application, at some point due to the nature of TCP/IP that your data is out in the open between your app and your database. That could potentially be tampered with, and your application can't do anything about it.  
So now, the argument is not whether it's better to store business rules in application logic, the question is why can't we put those rules in the DBMS itself, innately? The answer to that is simply because no DBMS allows that just yet.  The reason I think people argue for storing rules in the application is simply because there's really no other better alternative right now. If a .NET developer is comfortable with .NET and not with T-SQL, where do you think she'll put her business rules? In the .NET application or some other application gateway (not the DBMS). This is what I heard today in our discussion. We put all our rules in application logic because it's the best central repository we have... when really the best place should be the DBMS. After all, what have we been learning all along in this class? ORM is able to capture many of these rules... and yet, not all. Business rules are there to establish integrity with our data so they should be in the DBMS.
I believe it's useless to argue for or against storing rules in application logic because right now it's impossible to put them all in the DBMS. Even if you wrote stored procs or triggers for everything, that's not enforced by the DBMS innately (triggers are closer to automatic) plus the rules are still in code.  Furthermore, I believe most people put their rules in application code because it's just harder to do it in the DBMS and that many enterprise or even regular applications can target different storage systems (one application could use XML, MSSQL, MySQL, DB2, or MS Access as repositories). Imagine duplicating all your business rules across those systems... it's much easier to store the rules in the code, isn't it?  Let me be clear: I am not advocating that it is right to store business rules in app logic. I also believe that it is best to put business rules in the DBMS. All I am saying is, what choice do we have at the moment? Until a better DBMS is created, preferably using ORM as the tool to create the DB then we're SOL. Let the discussion begin!

EVEREST RESPONDS:

.. I love it.  I agree with all you have said. I just would not characterize it as "two sides of an argument." Both "sides" must spring from an environment of understanding of the basic concepts. There are two main traps we fall into: 

1. Believing that someday we can reach the ideal state of a perfect world.  Your discourse is filled with those thoughts, e.g., "no DBMS currently available..." (implying that one may be in the future), "at this current state in time" (suggesting that there may come a time), "no DBMS allows that just yet" (but perhaps a future DBMS product might?), "right now it is impossible" (with the implication that in the future it will be), "until a better DBMS is created" (better yes, but never perfect or complete).  "Man" (which includes woman) is imperfect, and therefore, so are his constructions, our DBMSs, our databases, our code, our systems. So try as we might to improve (and we should), we must always be alert to imperfection, weaknesses, failures, attacks, (whether inadvertent or deliberate, intentional) and build in countermeasures.  That is why we back up our data, build in redundancy, and why every government (every locus of power) must have "her majesty's loyal opposition" and the watchful eye of citizens and the media.

2. We tend to use, to favor that which is more familiar to us. If we are a programmer and that is what we know how to do, we will get things done by writing code in our favorite programming language. If we know how to use the full functionality of some DBMS, then that is what we will do as much as we can.  Some application coders need to be stretched to consider an alternative way of doing things that might get us closer to the ideal of serving people (management, organizations) AND incorporating appropriate safeguards (which also will never be perfect). 

.. The business rules and integrity constraints are those mechanisms that can lead to higher quality data, greater user confidence (in our databases, our systems, our IT departments, and our organizations), better decisions, and more effective and efficient operation. As you rightly state, ALL that happens in a computer system is done by executing code. But that is only half the story.  ALL that executing code is driven by specifications, definitions, declarations, input parameters, in other words "data" more broadly interpreted. Even the program code is the specification of a process. The accuracy and completeness of the process depends on the specifications, the directives, etc. 

.. As I show in my lecture on Business Rules (slides BRules.23 and .27), there is a hierarchy of levels where this information, the business rules, can be expressed:
1. DBMS
2. Stored Procedures, whose execution is under the control of the DBMS.
3. Code (methods) embedded (encapsulated) in "objects" which execute in a well-managed and controlled environment (OO, SOA, SaaS...)
4. User application code
5. External user actions (manual or sysadmin)
.. At all these levels the processes are driven by definitions, specifications.  The efficacy of those processes depends on those specifications, and we (humans) understand those processes through documentation which itself is usually quite imperfect.

.. Moving down these levels we encounter decreasing levels of control and increased dispersion. I put DBMS at the top, because that is the closest to the data and the most likely centralized point of control (least dispersion). Our objective is (or should be) to move these boundaries down so that work gets pushed up into more controlled environments. Another thing that happens as we move through these levels is increasing abstraction.  When we declare database integrity constraints, that is a specification (abstraction) of some underlying process (executed by code... somewhere). We (humans) do not need to know what goes on under the covers, we merely need to have confidence that it will be done correctly and consistently, as reflected in the definition or specification.  For example, we may understand the meaning of the square root operator, but how it gets executed on a computer is quite complex. I want to be hidden from that complexity.  When we write a program we call all sorts of subroutines or functions, which are defined at a pretty high level (of abstraction).

.. To the comment of "too bad we waited so long": The focus of my course is data modeling. The subject of a companion course would be DBMS and Database Administration. This issue we are discussing would properly belong in the second course.  However, I raise the question here because the success of maintaining the integrity of our databases according to our defined semantics depends upon the execution environment, the DBMS ++.

.. Let the discussion continue.

2017-12-22

Natural or surrogate keys – selection criteria


In a post on the Data Modeling forum in LinkedIn (2012/05), Bryan had 2 criteria for picking a key, Tony added a third and Kim had 4. Let me give you my nine which I put forth in my advanced database design class at the University of Minnesota (see http://geverest.umn.edu):

CRITERIA: (conflicts necessitate compromise)
* Unique ‑ Guaranteed!
* Ubiquitous ‑ have a value for every entity instance, ever. In the US the Social Security number was established in 1935 but we are now running out.  As another note, the SSA had no mandate to ensure that it was unique until many decades later ‑‑ not bad with a 75 year run!
* Unlimited ‑ won't ever run out of values; reuse?
* Unchanging, never changes ‑ "immutable"
* Under your own control ‑ manage the values/codes, or at least some independent body exists to assign and maintain codes to unsure uniqueness. E.g., the Universal Postal Union (http://www.upu.int) maintains the official names of all countries in the world. New countries must apply for and have approved a name, or a change from an old name or else they will be unable to exchange mail with other countries.
* Used by people in the user environment, if not, can you come up with one that satisfies the next criterion?
* 'Mnemonic' (easy for people to recognize/remember/generate)
* "Dataless" ‑ sole purpose is to identify ‑ should carry no other information; don't overwork an identifier
* Compact, easy and efficient to process and store ‑ often means 'numeric' and 'fixed length' (secondary)
> else, invent a surrogate (“Autonumber”) ‑ only if none exists naturally
Anyone have a 'U' word for any of the last three???

These criteria will usually get you close. If you can't get close, then take the 'else.'  It is important to distinguish the need for an identifier for users as expressed in a business oriented data model, and the model to be implemented in some DBMS. These two purposes are somewhat at odds with each other.  If you can't come up with a suitable natural key (according to the above criteria, and some compromise is usually necessary) then use a surrogate key. Then some sort of lookup would be needed for the users to find the surrogate key based upon the best information they have when doing a search. For almost any sort of natural key, you can nearly always find examples which make things difficult. e.g., we run out of numbers, a state changes the name of a county (so any number ID assigned based upon alphabetical order becomes invalid), we change characteristics on which the ID was assigned (e.g., should color be in the unique ID for an appliance? It usually is but I can paint my refrigerator ‑‑ color is an attribute, and should not be part of the ID).

One more comment. When we speak of having an identifier, it will only be unique within a defined and finite name space, that is, a particular context. If we go out of that name space we need to qualify the ID with an additional identifier of the expanded context, ad infinitum. 

TABLE THINK - multiplicity (manyness) in a relationship?

Suppose we have a relational table - [ X | A | B | ... ] where A and B are attributes of X.
Two common confusions:


(1)  First, we can say that there IS a relationship between X and A. Since we observe that for each X there is only one A (1NF), but for each A there could be many Xs, we may be led to say there is a one‑to‑many relationship between X and A.  However, that relationship is actually many X to one A, thus many to one, not one to many, as stated.  When you say one to many (between X and A), I must apply them respectively, otherwise, confusion results from the ambiguity and inconsistency.  So the relationship is Many to One between X and A, respectively.


(2) We observe that for each value of A there could be multiple different values of B in the table of X and vice versa.  Does that mean there is a many‑to‑many relationship between A and B?  Any discussion of oneness or manyness must presume there is a relationship.  When you say there is a M:N relationship between A and B, then I must be able to list all the valid pair wise associations as in a table with two columns.  And being M:N the same value of A can appear in multiple rows and the same value of B can appear multiple times.  Of course the exact same values of A and B on a single row could never appear more than once since it would be the same instance of the relationship.  Now for any given X, there is a value for A and a value for B.  Would that particular value of A be the one that is related to that value of B?  NO.  Furthermore, could there be the same pair of A and B values appearing on multiple rows in X?  YES.  So the juxtaposition of A and B in X does not represent a relationship between A and B.  It is spurious.  IF there is a relationship between A and B, it must be represented in another table, and could be 1:1, 1:M, M:1 or M:N.  But since we said that we know everything about X, A, and B and it is represented in the diagram, then there must not be a relationship between A and B.  Again, you can look at the table of X and see that there are multiple different values of A appearing beside the same value of B and vice versa.  However, that in no way means there is a relationship between A and B represented in the table of X.  

Physical Data Model vs. Implemented Database Schema ‑ To what degree should they match?

 By: Brent  (Linkedin Discussion: Started 2011 Nov. 9)
. . We're having an internal discussion about PDM standards. The existing standard says that, for all tables, columns, keys, indexes, and relationships, the PDM should match what was implemented in the physical database. We have a huge Teradata database that has no primary key constraints, foreign key constraints, or indexes implemented in the database, but they are still in the data model. Those components are part of the application code used to load the tables, so the total database implementation = database schema + data load rules, and the standard is satisfied. Agree? Disagree? What's your standard?

EVEREST RESPONDS:


First recognize that a data model goes through several stages from a detailed “conceptual” representation of the users’ world to final implementation in some DBMS.  This question suggests there are at least two stages.  Ideally, as the data model is transformed from one stage to the next, the (user) semantics are retained (perhaps with differing syntactical representations), but additional features and constructs may be added.  In general, the model at each stage will be different than the previous stage.  So I would answer that they need not and should not match, except in the semantics of the users’ world. 
Note that user semantics does not include things such as clustering data items into records/tables, converting the representation of relationships into foreign keys, adding indexes, surrogate keys, enforcing first normal form (other normal forms are still relevant), etc.  These are all introduced for the convenience of the system to be implementable in some data modeling scheme (such as relational) and some target DBMS, and reflecting storage efficiency and performance goals.  Note that none of these have anything to do with the users’ view in a data model, and should not be reflected in that first stage model (some call this the “conceptual” data model).  Here the notion of conceptual should not be confused with a high-level, or abstract presentation of a data model.  While we may present a data model in a high-level, abstraction, the underlying model needs to be as detailed as necessary to capture as much as possible of the exogenous semantics of the users’ world.  That model should have no consideration or constraints relating to physical implementation.
Back to the question.  The implemented database schema (in some target DBMS) may have some additional elements defined in a way that is peculiar to that system, but need not be expressed in the physical data model.  While the physical data model is a prelude to implementation it is primarily a vehicle for communication with people, to understand the data model.  The physical data model can include generic elements of a physical representation but not in the way which might be represented in the implemented schema. 
Of course, it is important to document (for people) the transformations (mappings) made from one stage of a data model to the next, and the rationale for those choices.  This documentation is key.  From the discussion so far, people seem to be assuming that all you have are the two data models, and the issue is to what extent are they different or should they be the same.  If you want them to be the same, then I would argue for two physical data models - one generic, and one tailored for the target DBMS implementation.  However, this is more work and probably not very useful.  So keep them different, and document the differences.  The tool (DBMS or data modeling tool) should have some ability to produce a human readable representation of the stored schema, even in a graphical form.  We need both logical/physical models and “logical maps.”
We should note here that data modeling schemes and DBMSs have their limitations (often chosen for efficiency reasons).  Hence, we need to transform or fudge or introduce (sometimes spurious) constructs to allow certain semantics to be represented in the next stage model.  For example, to transform an ER model to a relational model we must impose the “single atomic value” constraint on attributes (1NF) which means we must resolve all M:N relationships into two 1:M relationships with a intersection entity.  Also, we must represent all relationships with foreign keys, which is another reason why we cannot represent M:N relationships in a relational model.  None of this is necessary in the users’ conceptual model.  People don’t have a problem understanding M:N relationships!  Only a relational DBMS does!  One approach to implementing complex user constraints which are not expressible in the target DBMS is to embody it in triggers and stored procedures, or worse yet, in application code.  One example might be enforcing referential integrity, if that feature is not available in your target DBMS.   Now the semantics really get buried and difficult for users to comprehend.
To Alan Howlett’s point, if we try to (manually) maintain a PDM and an implemented schema which are intended to be the same, that is simply an invitation to introduce inconsistency.  If they are to be the same then they must be stored as a single data model.  Hopefully you have a tool which can generate user-friendly representations of the model, a graphical representation, or a schema for your DBMS (multiple?).

Logical vs. Physical data model

Jaunine Conradie (former student) asks:
. . Following the differences between logical and physical data models (e.g., include many‑to‑many relationships thus excluding association entities except if attributes are associated with the association entity)...
Here's the question: If the task is to draw a relational model, do I draw a logical or a physical data model?
If the answer is a logical data model, then I expect to not see association entities and thus also not the foreign keys that would have been included in the association entity.
If the answer is a physical data model, then I expect to see every attribute, as it would be implemented in a specific RDBMS (thus, including data types), based on a full analysis of all available attributes and their values.

Everest response:
    First, I think we need to drop the term "logical" model.  All data models are logical, and some have a physical representation by being implemented in some DBMS. The distinction between logical and physical models arose several decades ago in an effort to rise above the current data modeling schemes which were just trying to draw pictures of a database as implemented and stored physically.  Today we have all sorts of "logical" data modeling schemes.
. . What is important is that every data model is done with and conforms to some modeling scheme, whether logical or physical.  So we can have logical models which are relational, network, hierarchical, single flat file, extended network (allowing M:N relationships), ER, Extended ER (which included ternary relationships and sub/supertypes), multidimensional,... and Fact Oriented Models like ORM.
. . Terry Halpin refers to Relational models as "logical" and ORM models as "conceptual."  OK, but don't get hung up on the terminology; it is the semantics that are important, not the syntax, i.e., the choice of words to describe semantic concepts.


. . If the answer is a logical model, then it will be drawn according to some modeling scheme, which may be relational, in which case in includes foreign keys and intersection entities for M:N relationships.  This logical model could include attribute data types, although that information is getting closer to the physical.  Even though we call ORM conceptual, we do specify data types for the objects!
. . If the answer is a physical model we will generally include more information relating specifically to some DBMS, for example, conforming to the naming conventions and the available data types in the DBMS.  I also suggest dropping the term "physical" model, preferring instead "implementation" model.

. . The truth of the matter is that any data model can be represented over a range of stages, from conceptual to physical, and at each stage it conforms to a particular modeling scheme.

On the origin of the fork notation

A student posts: 
    Dr. Everest, I just caught the fact that you developed the "fork" notation.
Good thinking, it is the most clear and simple of all 1:M relationship notations by far!
‑It occurred to me that maybe the profile of a badminton birdie was your inspiration?

Everest response:

    At the time when I did that (in a 1976 paper*, which became chapter 4 in my Database Management book, McGraw-Hill, 1986, §4.3.2, p.132) my idea came from trying to resolve the difference between the then dominant diagraming scheme of an arrow from parent to child (due to Charlie Bachman in a 1969 paper), and Nijssen drawing the arrow in the opposite direction to represent a function.  What I did was move Nijssen's arrow head to the other end of the arc and voila, a fork!
    
*Here is the reference (click on the title to get a copy of the paper):
Gordon C. Everest, "BasicData Structure Models Explained With A Common ExampleComputing Systems 1976, Proceedings Fifth Texas Conference on Computing Systems, Austin, TX, 1976 October 18-19, pages 39-46. (Long Beach, CA: IEEE Computer Society Publications Office).  
            Original paper using a notation for relationship characteristics.  Thought to be the first to use what is now called a "fork" to represent multiplicity in a relationship.  An updated version appeared as Chapter 4 in Database Management:  Objectives, System Functions, and Administration, McGraw-Hill, 1986.

2017-12-20

Why number the rules of normalization? Is it helpful?

Chris Date/Ted Codd originally numbered the rules of normalization, so we have 1NF, 2NF, etc. They were defined in a cascading fashion. For example, for a data structure to be in 3NF, it must satisfy the 2NF rule, which must in turn satisfy the 1NF rule. This definition has persisted in virtually all introductory text books on data management and data modeling, evidently with the authors not applying any critical thinking to the question. The authors just blindly accept what has been written before.


So is it possible for a relation (entity record or table) to have no transitive dependencies (thus satisfying the condition for 3NF) but have a partial dependency (2NF). Would that situation mean that the record/relation did not satisfy 3NF? Is it helpful to say that? Even more, when we speak of nested relations or object modeling (as in UML, for example) they explicitly violate 1NF allowing nested repeating groups of data items or attributes in a record/relation. Does that mean that automatically such structures do not satisfy 2NF or 3NF, even if they had no partial dependencies or transitive dependencies?

The rules of normalization should be expressed or named by the condition which causes a violation.
1NF = NO multivalued dependency (or multivalued attribute thus requiring a many-to-many relationship).
2NF = NO partial dependency, i.e., the data item "attribute" is only dependent on part of a composite key.
3NF = NO transitive dependency on an intermediate data item.
...
Then the rules of normalization are independent and can be applied in any order.

Can we avoid the need for normalization entirely?

Normalization is the Achilles heel of data modeling. Even experienced data modelers get it wrong sometimes. Wouldn't it be nice if we had a data modeling scheme which did not require the modeler to think about and apply the rules of normalization? How might that be possible?

Any form of record-based modeling (e.g., Relational, ER, IDEF1X, UML...) requires the enforcement of normalization rules.  In record-based modeling, data items are clustered together in records (or tables) to describe a particular entity type, the data items are then called "attributes" of the entity.  The problem is that sometimes we put attributes into a table when we shouldn't.  The rules of normalization help us to identify which attributes violate the rules, and hence must be removed from that table (and probably put into some other, often new, table).

Unfortunately, this is necessary but not sufficient for a good design.  The rules of normalization are only applied to a single record/table at a time.  There may be inter-record relationships which require further examination to ensure a good design.  Another problem is:  you must first determine or designate the identifier before you can apply the rules of normalization.  That is because normalization is all about what "determines" each attribute.  To paraphrase Chris Date (which has been widely recounted) "every data item (in a record) must be single valued (for each value of the identifier key) [1NF], must be functionally dependent on the key, the whole key (if  composite) [2NF], and nothing but the key (i.e., directly dependent with no transitive dependencies) [3NF] ... so help me Codd!"  

Normalization is the test, and record decomposition is the remedy for violations (i.e., removing data items from the record).  The need for normalization can be avoided if we don't cluster data items into a record in the first place.  That is the essence of fact (oriented) modeling, such as ORM.  The inverse of functional dependency is determinancy.  In a normalized record, the determinant of an attribute is the key or identifier.  In fact modeling, we directly relate each data item with its determinant.  That then gives us the unambiguous way to store data items into records or tables without violating any of the rules of normalization.  In a sense, a fact model is the ultimate decomposition of a (record-based) database, such that each record now has at most one non-key domain.  Some have called this the 6th NF, but beware, there are other definitions of 6NF.

So the answer to your question is yes... if we design our data models using fact modeling we can avoid the need for normalization entirely!

Five rules of normalization in one

Do you think that it is possible to represent the five rules of normalization in a single rule? 
Well, here it is:
"Store each data item with its determinant as the identifier in a record/table."
By data item I mean any data element consisting of a domain of values (which are surrogates of things in the real world), which are members of a population called by the name of the data item/element.
Determinant is based on a relationship between two populations.  If the members of one, say Y, are functionally dependent on the other, say X, then we say that X determines Y.  The functional dependency can be expressed as Y = Fn(X).   For example, if we had department as one population and employees as another, and we assume that an employee can work in at most one department (at a time), then there is a functional relationship between them:  Dept = Fn(Employee), that is, given a member of the employee population (ID) that determines a unique value for Dept, hence a function.  If there cannot be a Y without an X, then we have a functional dependency.

In a side note, the relationship could involve a single population, in which case we call it a reflexive relationship.  An example would be "is boss of" on an employee population.

Why isn't ORM more popular?

2010 December on Linked In Adam Ralph asked:

About 4 years ago, I was tasked with designing a new data centric application. My first instinct was to draw a conceptual model. For previous projects, I'd used entity relationship modelling but I was keen to find out if there were any alternative approaches. Scouring the internet, I stumbled across ORM. I printed out Terry Halpin's article "Object Role Modeling: An Overview" and studied it over the next couple of days. The approach seemed rather compelling, especially given the natural language aspect. So, I worked my way through CSDP for part of the UoD for the application. The model which presented itself, with the automatic verbalization and relational model translation was so elegant that I have never looked back.

ORM has now become my starting and reference point for most of the work I do. Yet, I'm constantly surprised that no‑one else I encounter has even head of it. After four years, including a change of company (from one large blue chip to another) I still have not encountered a single person who has heard of ORM, let alone practiced it.

I believe ORM is by far the most expressive and natural data modeling technique currently available, so the question arises, why isn't it more popular?

EVEREST RESPONSE:

I am often asked this question.  It is addressed in my Advanced Database Design course which teaches ORM. Let me give you some reasons from my perspective.  Some reasons stem from the early years, others relate to the present state of data modeling practice.

In the early years in the Evolution of NIAM/ORM:

(1) CDC kept it proprietary.
(2) Few academics were involved (who need to publish to keep their jobs and get promoted!)
(3) Lack of strong vendor support
(4) Lack of good papers, text book, and inexpensive, viable tool to support NIAM.

ORM has its roots in the work of Sjir Nijssen, with later notable contributions by Terry Halpin.  In 1975 August in Germany I had the privilege of teaching a course with Sjir.  At that time we reviewed a couple of draft papers he had prepared on the work he was doing on the Kadaster project in the Netherlands while working for Control Data Corporation (CDC). The papers were published in 1976 and he called it “binary modeling.“  CDC closely guarded the method as valuable intellectual property and began offering Information Analysis services to clients (for a fee).  This became known as the Nijssen Information Analysis Method (NIAM).  Most of the papers were published in obscure journals so did not get out to a wide audience (unlike the seminal paper Ted Codd published in 1970 on the relational model of data).  CDC developed an automated tool to support NIAM, as did others in those earlier years but they did not become popular.  Most of the activity was in Europe with little exposure in the USA, except perhaps for my course in Advanced Database Design at the University of Minnesota which I have taught since 1971.  I came back from the 1975 encounter with Sjir convinced of the superiority of the modeling method over relational, ER, or CODASYL network modeling.  I was so excited that I incorporated it into my class. However, it was little more than a lecture or two.  There were no really good papers, no text book, and no viable tool. 

Then Sjir Nijssen went to the University of Queensland in Australia in 1983 as a visiting chaired professor.  There he hooked up with Terry Halpin who saw the superiority of NIAM.  Together they published the first comprehensive treatment of the modeling scheme which was renamed Object Role Modeling (ORM) in 1989. Then some others saw it as a better way to do data modeling and developed an ORM modeling tool (InfoDesigner from Serverware, to Asymetrix as InfoModeler, Visio, and finally Microsoft).  Finally we had a tool and a textbook.  I began teaching predominantly ORM in my Advanced Database Design class in early 1990s.  Today it is available wholly online.  Go to http://geverest.umn.edu  for more information.  The course is offered only in the spring, and 2011 begins on Jan 20.

(Terry and Sjir can validate and correct the information above).

Reasons for ORM‘s lack of popularity stemming from Current Data Modeling Practice - PART 1

() ORM is very difficult to grasp on your own by reading a book, without good training.  Adam, you are to be commended for taking the time and having the persistence to really dig into and understand ORM.  My course provides the necessary knowledge and skill to use ORM and its data modeling tool, NORMA.
  
() blind acceptance in the industry that ER/Relational data modeling is the best way for data modeling.  Practicing data modelers are always thinking in terms of ER diagrams as their data models, or worse, relational tables.  I call this “Table Think“ and the malady is “Tableitis.“


() Failure to recognize the problems inherent in the use of any record-based data modeling scheme, that is, based upon the formation of entity records or objects with a set of attributes.  Normalization is the way to detect and resolve some of these inherent problems.  Unfortunately, it is practically and theoretically impossible for any DBMS or data modeling tool to help the designer produce normalized structure, unless... they use ORM.   ORM does not distinguish the notion of entities/objects from that of attribute, everything is a domain or population of something (called an object, which is distinctly different from that in object-oriented modeling).  Object have attributes by virtue of the role they play in relationships with other objects.  Hence, ORM has two constructs – objects and roles (in relationships).

() Most practicing data modelers think they are doing a good job.  They do not realize or consider that there may be a better way.  My course is a hard sell to experienced data modelers.  But give me a chance to talk to them and it at least opens up the possibility of a better way.  I have had some skeptics start my course, but they soon get converted because ORM is so compelling.  I have even had some that go to the end of the class unconvinced, and come back to me months or years later to say they finally got it!  I thank former students, Kevin and Troy, for their postings on this discussion group.

() nearly all textbooks on database management or database design present only record-based data modeling schemes.  So we continue to perpetuate the myth that this is the way to go.

Reasons for ORM‘s lack of popularity stemming from Current Data Modeling Practice - PART 2

() Most published research studies comparing various data modeling schemes including ORM report that it is inferior.  Upon closer examination, we discover that the experiments were conducted by people who had incomplete knowledge and skill in ORM.

() ORM is often criticized as being bottom up and too detailed for conceptual modeling.  We hear this both from proponents and protagonists of ORM.  So we have subject area diagrams, or high-level ER diagrams.  That just means some of the underlying semantics are hidden in the presentation.  The attributes are still there, even if we only show entities and relationships.  Being too detailed or not is a matter of presentation, not modeling.  The goal of (conceptual) data modeling is to capture as rich a set of semantics as possible of the users' world being modeled.  But it is foolish to present it all at once (are your conference room walls papered with diagrams to show all the entities and attributes in your enterprise data model?)  Since humans have cognitive limitations, we present various abstractions of the model, eliminating parts or detail, to facilitate human understanding.

() All the major DBMSs are based on the relational data model(ing scheme) so we often model in tables.

() The annual ORM workshop held under the OnTheMove conferences does not get beyond a small group of very dedicated enthusiasts.

() Lack of vendor support.  With the acquisition of Visio 1999, Microsoft got the InfoModeler ORM tool.  They added it to the enterprise team editions of Visual Studio as Visio (not the same as is in the Office Suite).  Unfortunately, Microsoft has done nothing with the ORM tool.  It has essentially the same functionality as Visio got when they acquired InfoModeler 3.1.  Microsoft has not enhanced the functionality of the tool and provides no training in ORM or the use of the Visio tool.  Fortunately, there is a better alternative tool now available called NORMA -- open source and runs as a plug-in to Visual Studio.  With a valid ORM model, just push a button and NORMA generates a relational database guaranteed to be in fifth normal form, and ready for direct input as a data definition to any one of several popular DBMSs.  Nevertheless, NORMA still has much maturing to do.  It needs additional enhancement to become an industrial strength data modeling tool, and that is in the works at LogicBlox.

That was probably more reasons than you wanted.  If you desire a taste of or motivation for ORM, you can attend my Workshop entitled “Rethinking how we do data modeling“ at the DAMA Enterprise Data World conference (edw2011.wilshireconferences.com) , in Chicago, 2011 April 3-7.  I will also be giving a presentation entitled “Helping Business Users Understand Your Data Models."  More information is at my website: http://geverest.umn.edu . So the challenge is: how do we get the word out?

Value Object Type vs. Entity Object Type

A student asks:
... I'm still somewhat confused about when to use an entity type object and when to use a value type object. Let's take an example. Suppose we have two related columns of data:

Day      Day
Code    Name
Su           Sunday
M            Monday
T              Tuesday
W            Wednesday
Th           Thursday
F              Friday
Sa           Saturday
    It seems that there might be two ways to represent such a 1:1 relationship. We might have a fact statement with two entity type objects, written as:            
DayCode(.code) has DayName(.name)
or an entity type object and a value type object, written as:       
DayCode(.code) has DayName()
In either case, the predicate constraints could be unique/mandatory on both objects.  While both fact statements seem to have the same results, what are the advantages and disadvantages of using one or the other in such a situation?

EVEREST RESPONSE: 

Good question. I refer you to Halpin [2008], pages 10, 65, and 95. He doesn't say much about this but perhaps enough to understand the idea. . A value object type is a constant. It is self-identifying, that is, when you see the value written down, you (the user community) always know what is being referred to. So in your example, we always know what day of the week is referred to by either your DayName or DayCode. Each could be a value object type. In each of these two examples, what you really have is merely a string constructed from some alphabet. There is another population here and that is the “days of the week”. That would be a non-lexical object type (NOLOT). Then we can use a name or code to refer to the day of the week. We will pick one of them to be the reference mode for day of the week. And if that is always sufficient and unambiguous, we can make it a value object type… Let me take you through the thinking to answer your question. Everything we want to include in the database can initially be thought of as a NOLOT. (See slide# ORMod.42, which is essential to understanding this distinction which is made in Slide 43). Then we can ask, what are the different ways of referring to those things?  Each of those would be a separate object type. If we pick one of those, the day of the week can be referenced directly (with the name or code) and hence there is no need for some other scheme for identifying or referencing days of the week. Of course, the choice of identifier must be unique, ubiquitous, etc. (as discussed in class, slide# ERel.38)... NOTE that a value object type can also be a number. That includes all the number types we talked about in class. Also note that the string of numeric digits is actually referring to some point on a scale (that is the NOLOT), which can have various arithmetic properties (cardinal, interval, ordinal, nominal). We have to know what the base is in determining the number from the string of digits. For example, '10' is a very different number (the NOLOT) depending on the base, be it binary, or octal, or decimal. The representation of the number will be some chosen lexical object, constructed from some naming scheme. Again, in a particular context, the string of digits may be sufficient to always, and unambiguously refer to (represent) the number value, for all the users in that community. Hence, we could make the number a value object type… As you can see, this is not always hard and fast, depending on the user context. For example, we may be interested in how days of the week are referenced in other languages. Further, we may want to relate these to the days in some calendar other than the Gregorian calendar, which is used by most of the English speaking world. So you see, we really have to go back to thinking of the day (no longer of the week since other calendars may not have the concept of a week) as a NOLOT, apart from how it is referenced. The value object type is only used when values in a particular referencing scheme are sufficient and always unambiguous to refer to members of the NOLOT population. The NOLOTs are always there in the world being modeled. The naming and reference scheme are our own constructions. The conundrum is that we have no way to refer to members of the underlying NOLOT population (the semantics) without using the lexical handle (the syntax). As we said last week, we can only see or represent the semantics through the syntax… What is the advantage? The use of value types can simplify our model. We can let the value itself (the lexical representation) be the object. It is sometimes a convenient shorthand… Again, thank you for asking the question. It has given me an opportunity to think it through more clearly, I hope. Sorry for being so long winded.