Fabian Pascal posts (LinkedIn, Data Modeling, 2020/03/29)
Shown a data model diagram, he says "Actually, it is not a logical model, but a graphical representation of it that users understand -- the DBMS doesn't. Users understand the conceptual model SEMANTICALLY, the DBMS "understands" the true logical model ALGORITHMICALLY and that's not what your drawing is.
Everest responds:
Regarding model (re)presentation, the exact same model can be presented in a variety of ways - graphical diagram, formal linear (for machine processing, I take you call this "logical"). But I would hope and expect that the underlying semantics would be exactly the same. We build a data model, and it should not matter how it is expressed, as long as they are equivalent. The semantics relate to the model, not how it is expressed (Conceptual?). Semantics is the meaning behind what is presented. Furthermore, all models are logical, that is, built according to some set of rules of logic and formal argument, no matter how it is presented or who reads it (check your dictionary). The rules in this case constitute what I call the modeling scheme.
Intro
I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.
2020-03-29
Picking the right users for a data modeling project
Ken Evans said (LinkedIn, Data Modeling, 2020/03/29)
The piece of the puzzle that you have not mentioned is where the users understanding of their domain is rather vague.
Everest responds:
If you have users with a vague understanding of their domain, you are talking to the wrong people. I have found that there are people in the user domain who really do know what is going on, what their world looks like. They are often seen as troublemakers, asking the tough questions, complaining about how things are done (or not done) and suggesting how things could be done better. People on the front lines working in the trenches who actually think about what they are doing, are not satisfied with the status quo, going beyond their job description. Every organization has such people; you just need to find them. And the best way to find them is to ask other users. Most can readily tell you who they are. If there is no one they can point to, you have a dead or dieing organization where nobody cares. The people who fit the profile above will generally not be management or senior level -- who are usually tending to managing and training people, as they should be. Once you get the right people to the table it takes a skilled facilitator to elicit the needed information and document it in a usable form, i.e. a data model, such that they understand and concur with the representation.
The piece of the puzzle that you have not mentioned is where the users understanding of their domain is rather vague.
Everest responds:
If you have users with a vague understanding of their domain, you are talking to the wrong people. I have found that there are people in the user domain who really do know what is going on, what their world looks like. They are often seen as troublemakers, asking the tough questions, complaining about how things are done (or not done) and suggesting how things could be done better. People on the front lines working in the trenches who actually think about what they are doing, are not satisfied with the status quo, going beyond their job description. Every organization has such people; you just need to find them. And the best way to find them is to ask other users. Most can readily tell you who they are. If there is no one they can point to, you have a dead or dieing organization where nobody cares. The people who fit the profile above will generally not be management or senior level -- who are usually tending to managing and training people, as they should be. Once you get the right people to the table it takes a skilled facilitator to elicit the needed information and document it in a usable form, i.e. a data model, such that they understand and concur with the representation.
Who judges the accuracy of a data model?
Kevin Feeney says (LinkedIn, Data Modeling, 2020/03/29)
How do we know that we have a correct model? My general take would be that we know the model is correct, if the world moves and the model moves with it without breaking anything - if the model is wrong, you'll find duplication and other errors seep in. In terms of how you set the scope for the model and how you deal with systems integration and different views on data - generally you need to go upwards in abstraction and make the model correct at the top level. For example, when you find that two departments have different concepts of what a customer is, both of which are correct from their points of view, there are implicit subclasses (special types of customer) that can be made explicit to make everything correct.
Everest responds:
Kevin, I see two main problems with your viewpoint. (1) Sounds like you are saying design it, build it, and wait for problems to arise. Surely we need to judge correctness before we commit resources to implementation. That would be irresponsible and dangerous. Before implementing the model and building a database, we need to have some assurance that our model is an accurate representation of the user domain. (2) It sounds like you are depending on the designers/modelers to make judgments about model correctness. That is the last thing I would do. Too often I have found that the data modeling experts had only superficial understanding of the user domain. They may be well versed in the modeling task, but that doesn't produce a good model. The best modeling tool in the world and the best modeling methodology would be insufficient to produce a "correct" data model. Rather than a "correct" model I prefer to call it striving for an "accurate" data model, that is, one that accurately represents the user domain. As Simsion argued and I agree, there is no single correct model. So, who best to judge?
So, who best to judge the "correctness" of a data model? I say, the USERS THEMSELVES. They are the ones who understand their world better than anyone else. But you have to get the right users to the table. I have lead dozens of data modeling projects and we only go to implementation when ALL the user representatives sign off and say "Yes, this is an accurate representation of our world." If there are differences, they must be resolved among themselves (with wise direction from a trained data modeler). One caveat: the users must thoroughly understand the data model, in all its glorious detail (not high-level). This is the responsibility of the data modeler to ensure the users collectively understand all the details of the model -- an awesome responsibility. That means the users must understand the model diagrams and all the supporting documentation, particularly the definition of the "things" (entities, objects), relationships (binary, ternary, and more), and all the associated constraints (e.g., cardinalities). Our goal is to develop as rich a representation as possible of the semantics of the user domain, and that means having a rich set of constructs to use in developing the model. So far, I see ORM as the richest modeling scheme.
The best way to make this happen is for the user representatives to be part of the modeling team. In fact, they should be the ones in control. Upper management needs to grant release time to those users most knowledgeable about their domain. An experienced data modeler needs to facilitate and guide the modeling process and the development of the data model. The team needs to be allowed to meet and deliberate as long as necessary to arrive at a model which they all feel comfortable approving. In my experience the users have always known when they were done (and ready to go to implementation), although the time it took was difficult to predict up front. Only in one project were we unable to come to agreement and that is because we had the wrong user representatives at the table. They were little more than data entry clerks who really didn't understand the meaning of the data, why it was important, nor how it was used.
How do we know that we have a correct model? My general take would be that we know the model is correct, if the world moves and the model moves with it without breaking anything - if the model is wrong, you'll find duplication and other errors seep in. In terms of how you set the scope for the model and how you deal with systems integration and different views on data - generally you need to go upwards in abstraction and make the model correct at the top level. For example, when you find that two departments have different concepts of what a customer is, both of which are correct from their points of view, there are implicit subclasses (special types of customer) that can be made explicit to make everything correct.
Everest responds:
Kevin, I see two main problems with your viewpoint. (1) Sounds like you are saying design it, build it, and wait for problems to arise. Surely we need to judge correctness before we commit resources to implementation. That would be irresponsible and dangerous. Before implementing the model and building a database, we need to have some assurance that our model is an accurate representation of the user domain. (2) It sounds like you are depending on the designers/modelers to make judgments about model correctness. That is the last thing I would do. Too often I have found that the data modeling experts had only superficial understanding of the user domain. They may be well versed in the modeling task, but that doesn't produce a good model. The best modeling tool in the world and the best modeling methodology would be insufficient to produce a "correct" data model. Rather than a "correct" model I prefer to call it striving for an "accurate" data model, that is, one that accurately represents the user domain. As Simsion argued and I agree, there is no single correct model. So, who best to judge?
So, who best to judge the "correctness" of a data model? I say, the USERS THEMSELVES. They are the ones who understand their world better than anyone else. But you have to get the right users to the table. I have lead dozens of data modeling projects and we only go to implementation when ALL the user representatives sign off and say "Yes, this is an accurate representation of our world." If there are differences, they must be resolved among themselves (with wise direction from a trained data modeler). One caveat: the users must thoroughly understand the data model, in all its glorious detail (not high-level). This is the responsibility of the data modeler to ensure the users collectively understand all the details of the model -- an awesome responsibility. That means the users must understand the model diagrams and all the supporting documentation, particularly the definition of the "things" (entities, objects), relationships (binary, ternary, and more), and all the associated constraints (e.g., cardinalities). Our goal is to develop as rich a representation as possible of the semantics of the user domain, and that means having a rich set of constructs to use in developing the model. So far, I see ORM as the richest modeling scheme.
The best way to make this happen is for the user representatives to be part of the modeling team. In fact, they should be the ones in control. Upper management needs to grant release time to those users most knowledgeable about their domain. An experienced data modeler needs to facilitate and guide the modeling process and the development of the data model. The team needs to be allowed to meet and deliberate as long as necessary to arrive at a model which they all feel comfortable approving. In my experience the users have always known when they were done (and ready to go to implementation), although the time it took was difficult to predict up front. Only in one project were we unable to come to agreement and that is because we had the wrong user representatives at the table. They were little more than data entry clerks who really didn't understand the meaning of the data, why it was important, nor how it was used.
2020-03-26
Determining the "Correctness" of a data model
LinkedIn, Data modeling, 2020/3/26
I asked: How do we know when a data model is correct?
Ken Evans responded: That's easy, when the model conforms to stated requirements.
I then asked "who determines/documents the requirements?"
Ken responded:
It does not matter "who" determines the requirements. The point is that you can only judge whether a deliverable is "correct" if you have a set of pre-established requirements against which to assess "correctness". This principle has been widely accepted in the quality management discipline since 1978 when Philip Crosby published his book "Quality is Free." Crosby makes the point that "Quality is conformance to requirements."
I asked: How do we know when a data model is correct?
Ken Evans responded: That's easy, when the model conforms to stated requirements.
I then asked "who determines/documents the requirements?"
Ken responded:
It does not matter "who" determines the requirements. The point is that you can only judge whether a deliverable is "correct" if you have a set of pre-established requirements against which to assess "correctness". This principle has been widely accepted in the quality management discipline since 1978 when Philip Crosby published his book "Quality is Free." Crosby makes the point that "Quality is conformance to requirements."
Everest Responds:
Sorry about the "who", perhaps I should have said "how." While I accept the general principle, is it realistic? Crosby's statement begs the question, if quality is conformance to requirements, then who/how do we determine the correctness and completeness of the stated requirements? I have yet to see anything close to an a priori statement of requirements that was sufficient to judge the correctness of the end result, i.e., a domain data model. Furthermore, I have yet to see any guidelines sufficient for preparing a statement of requirements for a data modeling project. I would love to see any examples.
.. To me, the only satisfactory "statement of requirements" sufficient to judge the correctness of the model would be the final, detailed data model itself. Anything less than that would not be sufficient to express the full set of semantics to be included in the final model. In the case of ORM perhaps, a complete set of elementary fact sentences, with well defined object types and predicates. But that is what the entire process of data modeling is all about -- to discover and document the semantics of some user domain. We want to capture at least as rich a set of semantics as possible given our modeling scheme (which is why we need to use a modeling scheme such as ORM which captures many more semantics than any other scheme, including ER/relational).
.. So the question remains, whether we are talking about the data model, or the requirements for a data model -- how do we judge the correctness of a data model? Who is in the best position to do that?
2020-03-25
Is data modeling Design or Description?
Ken Evans asks (LinkedIn, Data Modeling, 2020/3/25)
In the book "Data Modeling: Theory and
Practice" Graeme Simsion describes the result of his extensive research into the "Design or Description?" question. Graeme's research showed that whilst many people believe that a data modeler's job is to "describe" a reality that is out there, the truth is that data modelers are designers rather than describers.
Everest Responds:
I do not intend to contradict what Simsion said. If modeling was purely a descriptive activity
it would be easy, although we would have differing points of view, different
interpretations. It is the differences
in these descriptions that means it is best considered a "design"
activity -- precisely because there are many choices to be made in building the
model. The modeler CHOOSES how to
represent that domain in building the model.
So the question remains, how do we know when we have a correct
model? Interesting that Simsion maintains
that there is no one correct model, hence a design activity. I agree with that, which is why I posed the
question about the goal of data modeling is to find THE correct model.
… Or perhaps one would be open to finding A correct model!
However, that still doesn't answer the question: How do we know it is A correct
model?
Is a data model a "representation"? ...of what?
Ken Evans asks (LinkedIn, Data modeling, 2020/3/25)
Hmm. Is a data model a
"representation"?
EVEREST RESPONDS:
First of all I don't like the phrase "Data Model"
(sorry "Ted" Codd). It
suggests that it is a "model of data." That is misleading to someone outside of our
community. It is only a model of data if
we have some data. Then the model would
be an (abstract) "representation" of the data. For us, a "data" model is a model
of some aspects of a domain of interest to a community of users, real (world)
or imagined/desired/yet to be built. It
is a model "in data," that is, built using informational constructs,
all guided by a modeling scheme. The
modeling scheme tells us what to look for in the domain and how to represent it
in the model. So we identify a
population of similar things, give it a label and a definition, and put a box
or circle into a diagram to represent that population of things. We build up or "design" a model
with lots of types of things, add relationships among those things, and
constraints on those things and relationships. The modeling scheme tells us how to represent
those relationships and constraints in our model. Graham Witt adds some light to this argument by
calling it a "business data model."
2018-05-15
Data is like the corn flakes
In our information systems and IT shops, computers, communication networks, programs, systems, programmers, indeed, all IT/IS personnel, are merely "factors of production." We don't "eat" the factors of production of corn flakes - the bins which store the corn and other ingredients, machines which roll out the corn, which mix the ingredients, and fill the boxes, and the belts and pipes that move the product in the warehouse, and even the warehouse. No, we eat the corn flakes, just like we "eat" the data. It is the output from our computer information systems which is of value.
2018-05-14
Data must be managed as a asset
by Martha Lemoine, on the Data-centric manifesto.
Everest responds:
Yes, we often say this but what does it really mean. Look at the definition of Asset in accounting. Organizations spend money. We call these expenditures. The general assumption is that it benefits the current accounting period (year) because it is basically helping to generate revenue for the same period in which it is spent. So we then call it an expense. However, if it is a relatively large expenditure which benefits future years, say a new building or piece of machinery, then accountants want to capitalize the expenditure and spread it as an expense (depreciation) over multiple years, offsetting future revenue it helped to generate. A capitalized expenditure is called an asset. Capitalizing an expenditure, pushes an expense into the future, thus increasing revenues for the current accounting period or year.
.. Now in the case of data, does it benefit future time periods, i.e., help to generate revenue in the future? Of course. So what expenditures are made to collect, store, process, retrieve, maintain, update, etc.? Should they be capitalized? That is the hard part. Nobody does, except maybe a firm whose business is selling data, but most of the time they don't. So we must be careful calling data an "asset." Nevertheless, we can certainly call it "something of value" which, as you say, must be managed as an asset.
2018-03-12
Data Modeling and NoSQL
To
the question “How relevant is data
modeling in the world of NoSQL?” I give the following answer.
The main purpose of data modeling is to
understand the business, some application domain, some user world. The model becomes a representation of that
world -- the "things" in it, the relationships among those things and
any constraints on those things or relationships. A secondary purpose is to build a database to
contain information which pertains to and describes that domain. Hence, the business data model should not be
concerned with issues of physical stored representation, or the
transformations/manipulations/constraints which are imposed to facilitate
implementation in some data (storage) management system. That could be a relational DBMS, or a NoSQL
tool. Generally we speak of the model
coming first, then the implementation, and finally, the data gets collected and
stored according to the model. However,
increasingly the data already exists in some form. Which leaves us with the task of figuring out
what it means, what it represents -- that is, understanding the data as it
represents some user domain. NoSQL tools
are often designed to deal with existing data and to process it more
efficiently (that may be an oversimplification!). Either way, you must understand the business
in order to make sense of the data.
BOTTOM
LINE: it is vitally important to
understand the principles, methods, and approaches of data modeling which can
help you get to an understanding of the data and the business domain it
represents. Whether using traditional database management tools or NoSQL tools you still need to understand the data.
2018-01-02
Data modeling course in a nutshell
A former student asks:
.. I took your class last
spring and am currently in talks with some companies for a junior database
position. I find it a little challenging
when it comes to talking about exactly what I learned in your class as we
covered a lot of much valuable information. If you have any input for a quick
way to describe all of the information I learned that would be very helpful.
Concerning Data Model Madness: what are we talking about?
Martjin posts
(2012/10/11)
.. There have been endless
debates on how to name, identify, relate and transform the various kinds of
Data Models we can, should, would, must have in the process of designing,
developing and managing information systems (like Data warehouses). We talk
about conceptual, logical and physical data models, usually in the context of
certain tools, platforms, frameworks or methodologies. A confusion of tongues
is usually the end result. Recently David Hay has made an interesting video
(Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this
issue in a consistent and complete manner. But on LinkedIn this was already
questioned if this was a final or universal way of looking at such models.
.. I note that if you use e.g.
FCO‑IM diagrams you can go directly from conceptual to "physical"
DBMS schema if you want to, even special ones like Data Vault or Dimensional
Modeling. I also want to remark that there are 'formal language' modeling
techniques like Gellish that defy David's classification scheme. They are both
ontological and fact driven and could in theory go from ontological to physical
in one step without conceptual or logical layer (while still be consistent and
complete btw, so no special assumptions except a transformation strategy). The
question arises how many data model layers we want or need, and what each layer
should solve for us. There is tension between minimizing the amount of layers
while at the same time not overloading a layer/model with too much semantics
and constructs which hampers its usability.
.. For me this is governed by
the concept of concerns, pieces of interest that we try to describe with a
specific kind of data model. These can be linguistic concerns like
verbalization and translation, semantic concerns like identification,
definition and ontology, Data quality constraints like uniqueness or
implementation and optimization concerns like physical model transformation
(e.g. Data Vault, Dimensional Modeling), partitioning and indexing.
Modeling/diagramming techniques and methods usually have primary concerns that
they want to solve in a consistent way, and secondary concerns they can model,
but not deem important (but that are usually important concerns at another
level/layer!). What makes this even more difficult is that within certain kinds
of data models there is also the tension between notation, approach and theory
(N.A.T. principle). E.g. the relational model is theoretically sound, but the
formal notation of nested sets isn't visually helpful. ER diagramming looks
good but there is little theoretic foundations beneath it.
.. I personally think we should
try to rationalize the use of data model layers, driven by concerns, instead of
standardizing on a basic 3 level approach of conceptual, logical, and physical.
We should be explicit on the concerns we want to tackle in each layer instead
of using generic suggestive layer names.
I would propose the following (data) model layers minimization rule:
A layered (data) modeling scenario supports the concept of separation
of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers
using a minimum of modeling methodologies and notations.
Physical data model governed by business (logical) data model
Jaunine Conradie (student) asks: (email to me on 2012 Aug 13)
The typical Teradata
architecture includes a staging layer (stages data from sources in the form
that they exist on the source), an integration layer (3NF integrated data from
stage), and a semantic layer (dimensional view on the integration layer data).
.. We are two modelers on the project. The problem that we have is that
the ETL lead is demanding a model change to facilitate data loads and queries.
.. In short, a Subscriber on the network (Vodacom here in South Africa,
which is like Sprint or Verizon in the USA) have various Access Methods of
various Access Method Types. Types can
for example be Primary (cell phone number), Data (for internet access), etc.
The change being requested by the ETL lead is to host the Primary
access method on Subscriber.
This is problematic because:
1. This is inconsistent with the implementation, where some access
methods for a subscriber will now be carried on the Access Method Subscriber
entity, and some on the Subscriber entity itself.
2. We do not (and should not) carry history on the Subscriber entity.
3. Teradata has specialized index types to facilitate joins (this
negating the request to denormalize because of query performance)
How do we convince the program manager to not let ETL‑related
considerations dictate the design of the physical model?
Definitions of concepts should start with their supertype. Why not?
Andries van Renssen -posts on LinkedIn.
A good definition of a concept should build on the definition of more
generalized concepts. This can be done by two parts of a definition:
1. Specifying that the concept is a subtype of some other more
generalized concept.
2. Specifying in what respect the concept deviates from other subtypes
of the same supertype concept.
Such kind of definitions have great advantages, such as:
‑ The definition of (and the knowledge about) the supertype concept
does not need to be repeated, as it is also applicable ("by
inheritance") for the defined (subtype) concept.
‑ The definition implies a taxonomy structure that helps to find
related concepts that are also subtypes of the same supertype (the 'sister‑concepts').
Comparison with the 'sister‑concepts' greatly helps to refine the definitions.
Furthermore, the definitions prepare for creating an explicit taxonomy.
‑ Finding the criteria that distinguishes the concept from its 'sister
concepts' prepares for explicit modeling of the definitions, which prepares for
computerized interpretation and the growth towards a common Formal Language.
--Thus, why not?
Labels:
attribute,
classification,
concepts,
extensional set,
generalization,
identification,
identity,
intentional set,
language,
ontology,
population,
role,
specialization,
subtype,
supertype,
taxonomy,
vocabulary
Can we argue more about business rules?
Posted by Kamran
Ayub - Thursday, May 6, 2010, 06:24 PM
I think it's interesting; it's too bad we waited
so long to argue about it. Let me say
first off: I understand both arguments. I have my own opinions, but I can
definitely see both sides to this. First,
I think we should establish some FACTS that we can all agree on:
1.
No DBMS currently available (for at least production-oriented
apps) can completely express every business rule we need to express innately.
(Src: experience and Everest)
2.
No matter which way you slice it, business rules eventually
get turned into code somewhere whether it's at the DBMS or in
application logic. At some point, there's a piece of machine language where the
rule is expressed and validated in 1s and 0s.
There,
now that we agree on those two points, what exactly are we arguing about? If you believe Fact 1, then it follows
that it is not possible to ever express all our business rules in the DBMS
(at this current state in time).
Here's my view:
All DBMS do enforce a limited set of business rules. Thus, if you
argue that your application code holds all business rules in one central
place, you are wrong because the DBMS enforces a few of those
constraints (mandatoriness/uniqueness/etc.). Most of us end up duplicating
these rules to avoid catching exceptions, without thinking about it. Secondly, you cannot guarantee that
there is only a single point of entry to your data. Don't argue with me; there is a sysadmin account
somewhere on the system the data is stored and at least that account has
full authority to change the data. How many of us input sample data into our
SQL databases to fix a data error, sample data, or test triggers? Exactly.
Besides, let's not forget you had to define that structure somehow! It may be
true your business users can't access that data directly, but somebody
is the data steward and she has full control, rules or no rules. By the way,
what about unauthorized access? Whoops, guess that hacker didn't
care about all your application-enforced business rules!
It's fine if you argue that the
hacker could completely destroy your system but you must agree that if your
logic is stored in the application, at some point due to the nature of TCP/IP
that your data is out in the open between your app and your database. That
could potentially be tampered with, and your application can't do anything
about it.

So now, the argument is not
whether it's better to store business rules in application logic, the question
is why can't we put those rules in the DBMS itself, innately? The answer
to that is simply because no DBMS allows that just yet. The reason I think people argue for storing
rules in the application is simply because there's really no other better
alternative right now. If a .NET developer is comfortable with .NET and not
with T-SQL, where do you think she'll put her business rules? In the .NET
application or some other application gateway (not the DBMS). This is what I
heard today in our discussion. We put all our rules in application logic
because it's the best central repository we have... when really the best
place should be the DBMS. After all, what have we been learning all
along in this class? ORM is able to capture many of these rules... and yet, not
all. Business rules are there to establish integrity with our data so they should
be in the DBMS.
I believe it's useless to argue for or against storing rules in
application logic because right now it's impossible to put them all in the
DBMS. Even if you wrote stored procs or triggers for everything, that's not
enforced by the DBMS innately (triggers are closer to automatic) plus the rules
are still in code. Furthermore, I
believe most people put their rules in application code because it's just
harder to do it in the DBMS and that many enterprise or even regular
applications can target different storage systems (one application could use
XML, MSSQL, MySQL, DB2, or MS Access as repositories). Imagine duplicating all
your business rules across those systems... it's much easier to store the rules
in the code, isn't it? Let me be clear: I
am not advocating that it is right to store business rules in app logic.
I also believe that it is best to put business rules in the DBMS. All I am
saying is, what choice do we have at the moment? Until a better DBMS is created,
preferably using ORM as the tool to create the DB then we're SOL.
Let the discussion begin!

EVEREST RESPONDS:
.. I love it. I agree with all you
have said. I just would not characterize it as "two sides of an
argument." Both "sides" must spring from an environment of
understanding of the basic concepts. There are two main traps we fall
into:
1. Believing that someday we can reach the ideal
state of a perfect world. Your discourse is filled with those thoughts,
e.g., "no DBMS currently available..." (implying that one may be in
the future), "at this current state in time" (suggesting that there
may come a time), "no DBMS allows that just yet" (but perhaps a
future DBMS product might?), "right now it is impossible" (with the
implication that in the future it will be), "until a better DBMS is
created" (better yes, but never perfect or complete).
"Man" (which includes woman) is imperfect, and therefore, so are his
constructions, our DBMSs, our databases, our code, our systems. So try as
we might to improve (and we should), we must always be alert to imperfection,
weaknesses, failures, attacks, (whether inadvertent or deliberate, intentional)
and build in countermeasures. That is why we back up our data, build in
redundancy, and why every government (every locus of power) must have "her
majesty's loyal opposition" and the watchful eye of citizens and the
media.
2. We tend to use, to favor that which is more
familiar to us. If we are a programmer and that is what we know how to do,
we will get things done by writing code in our favorite programming
language. If we know how to use the full functionality of some DBMS, then
that is what we will do as much as we can. Some application coders need
to be stretched to consider an alternative way of doing things that might get
us closer to the ideal of serving people (management, organizations) AND
incorporating appropriate safeguards (which also will never be perfect).
.. The business rules and integrity constraints
are those mechanisms that can lead to higher quality data, greater user confidence
(in our databases, our systems, our IT departments, and our organizations), better
decisions, and more effective and efficient operation. As you rightly state,
ALL that happens in a computer system is done by executing code. But that
is only half the story. ALL that executing code is driven by
specifications, definitions, declarations, input parameters, in other words
"data" more broadly interpreted. Even the program code is the
specification of a process. The accuracy and completeness of the process
depends on the specifications, the directives, etc.
.. As I show in my lecture on Business Rules (slides
BRules.23 and .27), there is a hierarchy of levels where this information, the
business rules, can be expressed:
1. DBMS
2. Stored Procedures, whose execution is under
the control of the DBMS.
3. Code (methods) embedded (encapsulated) in
"objects" which execute in a well-managed and controlled environment
(OO, SOA, SaaS...)
4. User application code
5. External user actions (manual or sysadmin)
.. At all these levels the processes are driven
by definitions, specifications. The efficacy of those processes depends
on those specifications, and we (humans) understand those processes through
documentation which itself is usually quite imperfect.
.. Moving down these levels we encounter
decreasing levels of control and increased dispersion. I put DBMS at the
top, because that is the closest to the data and the most likely centralized
point of control (least dispersion). Our objective is (or should be) to
move these boundaries down so that work gets pushed up into more controlled
environments. Another thing that happens as we move through these levels
is increasing abstraction. When we declare database integrity
constraints, that is a specification (abstraction) of some underlying process
(executed by code... somewhere). We (humans) do not need to know what goes
on under the covers, we merely need to have confidence that it will be done
correctly and consistently, as reflected in the definition or
specification. For example, we may understand the meaning of the square
root operator, but how it gets executed on a computer is quite complex. I
want to be hidden from that complexity. When we write a program we call
all sorts of subroutines or functions, which are defined at a pretty high level
(of abstraction).
.. To the comment of "too bad we waited so
long": The focus of my course is data modeling. The subject of a
companion course would be DBMS and Database Administration. This issue we
are discussing would properly belong in the second course. However, I
raise the question here because the success of maintaining the integrity of our
databases according to our defined semantics depends upon the execution
environment, the DBMS ++.
.. Let the discussion continue.
2017-12-22
Natural or surrogate keys – selection criteria
In a post on the Data Modeling forum in LinkedIn (2012/05), Bryan had 2
criteria for picking a key, Tony added a third and Kim had 4. Let me give you
my nine which I put forth in my advanced database design class at the
University of Minnesota (see http://geverest.umn.edu):
CRITERIA: (conflicts necessitate compromise)
* Unique ‑ Guaranteed!
* Unlimited ‑ won't ever run out of values; reuse?
* Unchanging, never changes ‑ "immutable"
* Under your own control ‑ manage the values/codes, or at least some
independent body exists to assign and maintain codes to unsure uniqueness.
E.g., the Universal Postal Union (http://www.upu.int) maintains the official
names of all countries in the world. New countries must apply for and have
approved a name, or a change from an old name or else they will be unable to
exchange mail with other countries.
* Used by people in the user environment, if not, can you come up with
one that satisfies the next criterion?
* 'Mnemonic' (easy for people to recognize/remember/generate)
* "Dataless" ‑ sole purpose is to identify ‑ should carry no
other information; don't overwork an identifier
* Compact, easy and efficient to process and store ‑ often means
'numeric' and 'fixed length' (secondary)
> else, invent a surrogate (“Autonumber”) ‑ only if none exists
naturally
Anyone have a 'U' word for any of the last three???
These criteria will usually get you close. If you can't get close, then
take the 'else.' It is important to
distinguish the need for an identifier for users as expressed in a business
oriented data model, and the model to be implemented in some DBMS. These two
purposes are somewhat at odds with each other.
If you can't come up with a suitable natural key (according to the above
criteria, and some compromise is usually necessary) then use a surrogate key.
Then some sort of lookup would be needed for the users to find the surrogate
key based upon the best information they have when doing a search. For almost
any sort of natural key, you can nearly always find examples which make things
difficult. e.g., we run out of numbers, a state changes the name of a county
(so any number ID assigned based upon alphabetical order becomes invalid), we
change characteristics on which the ID was assigned (e.g., should color be in
the unique ID for an appliance? It usually is but I can paint my refrigerator ‑‑
color is an attribute, and should not be part of the ID).
One more comment. When we speak of having an identifier, it will only
be unique within a defined and finite name space, that is, a particular
context. If we go out of that name space we need to qualify the ID with an
additional identifier of the expanded context, ad infinitum.
TABLE THINK - multiplicity (manyness) in a relationship?
Suppose we have a relational table - [ X | A | B | ... ] where A and B are attributes of X.
Two common confusions:
(1) First, we can say that there IS a relationship between X and A. Since we observe that for each X there is only one A (1NF), but for each A there could be many Xs, we may be led to say there is a one‑to‑many relationship between X and A. However, that relationship is actually many X to one A, thus many to one, not one to many, as stated. When you say one to many (between X and A), I must apply them respectively, otherwise, confusion results from the ambiguity and inconsistency. So the relationship is Many to One between X and A, respectively.
(2) We observe that for each value of A there could be multiple different values of B in the table of X and vice versa. Does that mean there is a many‑to‑many relationship between A and B? Any discussion of oneness or manyness must presume there is a relationship. When you say there is a M:N relationship between A and B, then I must be able to list all the valid pair wise associations as in a table with two columns. And being M:N the same value of A can appear in multiple rows and the same value of B can appear multiple times. Of course the exact same values of A and B on a single row could never appear more than once since it would be the same instance of the relationship. Now for any given X, there is a value for A and a value for B. Would that particular value of A be the one that is related to that value of B? NO. Furthermore, could there be the same pair of A and B values appearing on multiple rows in X? YES. So the juxtaposition of A and B in X does not represent a relationship between A and B. It is spurious. IF there is a relationship between A and B, it must be represented in another table, and could be 1:1, 1:M, M:1 or M:N. But since we said that we know everything about X, A, and B and it is represented in the diagram, then there must not be a relationship between A and B. Again, you can look at the table of X and see that there are multiple different values of A appearing beside the same value of B and vice versa. However, that in no way means there is a relationship between A and B represented in the table of X.
Physical Data Model vs. Implemented Database Schema ‑ To what degree should they match?
By: Brent (Linkedin Discussion: Started 2011 Nov. 9)
. . We're having an internal discussion about PDM standards. The existing standard
says that, for all tables, columns, keys, indexes, and relationships, the PDM
should match what was implemented in the physical database. We have a huge
Teradata database that has no primary key constraints, foreign key constraints,
or indexes implemented in the database, but they are still in the data model.
Those components are part of the application code used to load the tables, so
the total database implementation = database schema + data load rules, and the
standard is satisfied. Agree? Disagree? What's your standard?
EVEREST RESPONDS:
First recognize that a data model goes through several
stages from a detailed “conceptual” representation of the users’ world to final
implementation in some DBMS. This
question suggests there are at least two stages. Ideally, as the data model is transformed
from one stage to the next, the (user) semantics are retained (perhaps with
differing syntactical representations), but additional features and constructs
may be added. In general, the model at
each stage will be different than the previous stage. So I would answer that they need not and
should not match, except in the semantics of the users’ world.
Note that user semantics does not include things such
as clustering data items into records/tables, converting the representation of
relationships into foreign keys, adding indexes, surrogate keys, enforcing
first normal form (other normal forms are still relevant), etc. These are all introduced for the convenience
of the system to be implementable in some data modeling scheme (such as
relational) and some target DBMS, and reflecting storage efficiency and
performance goals. Note that none of
these have anything to do with the users’ view in a data model, and should not
be reflected in that first stage model (some call this the “conceptual” data
model). Here the notion of conceptual
should not be confused with a high-level, or abstract presentation of a
data model. While we may present a data
model in a high-level, abstraction, the underlying model needs to be as
detailed as necessary to capture as much as possible of the exogenous semantics
of the users’ world. That model should
have no consideration or constraints relating to physical implementation.
Back to the question.
The implemented database schema (in some target DBMS) may have some
additional elements defined in a way that is peculiar to that system, but need
not be expressed in the physical data model.
While the physical data model is a prelude to implementation it is
primarily a vehicle for communication with people, to understand the data
model. The physical data model can
include generic elements of a physical representation but not in the way which
might be represented in the implemented schema.
Of course, it is important to document (for people)
the transformations (mappings) made from one stage of a data model to the next,
and the rationale for those choices.
This documentation is key. From
the discussion so far, people seem to be assuming that all you have are the two
data models, and the issue is to what extent are they different or should they
be the same. If you want them to be the
same, then I would argue for two physical data models - one generic, and one
tailored for the target DBMS implementation.
However, this is more work and probably not very useful. So keep them different, and document the
differences. The tool (DBMS or data
modeling tool) should have some ability to produce a human readable representation
of the stored schema, even in a graphical form.
We need both logical/physical models and “logical maps.”
We should note here that data modeling schemes and
DBMSs have their limitations (often chosen for efficiency reasons). Hence, we need to transform or fudge or
introduce (sometimes spurious) constructs to allow certain semantics to be
represented in the next stage model. For
example, to transform an ER model to a relational model we must impose the
“single atomic value” constraint on attributes (1NF) which means we must
resolve all M:N relationships into two 1:M relationships with a intersection
entity. Also, we must represent all
relationships with foreign keys, which is another reason why we cannot
represent M:N relationships in a relational model. None of this is necessary in the users’
conceptual model. People don’t have a
problem understanding M:N relationships!
Only a relational DBMS does! One
approach to implementing complex user constraints which are not expressible in
the target DBMS is to embody it in triggers and stored procedures, or worse
yet, in application code. One example
might be enforcing referential integrity, if that feature is not available in
your target DBMS. Now the semantics
really get buried and difficult for users to comprehend.
To Alan Howlett’s point, if we try to (manually)
maintain a PDM and an implemented schema which are intended to be the same,
that is simply an invitation to introduce inconsistency. If they are to be the same then they must be
stored as a single data model. Hopefully
you have a tool which can generate user-friendly representations of the model,
a graphical representation, or a schema for your DBMS (multiple?).

Logical vs. Physical data model
Jaunine Conradie (former student) asks:
. . Following the differences between logical and physical data models
(e.g., include many‑to‑many relationships thus excluding association entities
except if attributes are associated with the association entity)...
Here's the question: If the task is to draw a relational model, do I draw
a logical or a physical data model?
If the answer is a logical data model, then I expect to not see
association entities and thus also not the foreign keys that would have been
included in the association entity.
If the answer is a physical data model, then I expect to see every
attribute, as it would be implemented in a specific RDBMS (thus, including data
types), based on a full analysis of all available attributes and their values.
Everest response:
First, I think we need to drop the term "logical" model. All data models are logical, and some have a
physical representation by being implemented in some DBMS. The distinction
between logical and physical models arose several decades ago in an effort to
rise above the current data modeling schemes which were just trying to draw
pictures of a database as implemented and stored physically. Today we have all sorts of
"logical" data modeling schemes.
. . What is important is that every data model is done with and conforms
to some modeling scheme, whether logical or physical. So we can have logical models which are
relational, network, hierarchical, single flat file, extended network (allowing
M:N relationships), ER, Extended ER (which included ternary relationships and sub/supertypes),
multidimensional,... and Fact Oriented Models like ORM.
. . Terry Halpin refers to Relational models as "logical" and
ORM models as "conceptual."
OK, but don't get hung up on the terminology; it is the semantics that
are important, not the syntax, i.e., the choice of words to describe semantic
concepts.
. . If the answer is a logical model, then it will be drawn according to
some modeling scheme, which may be relational, in which case in includes
foreign keys and intersection entities for M:N relationships. This logical model could include attribute
data types, although that information is getting closer to the physical. Even though we call ORM conceptual, we do
specify data types for the objects!
. . If the answer is a physical model we will generally include more
information relating specifically to some DBMS, for example, conforming to the
naming conventions and the available data types in the DBMS. I also suggest dropping the term
"physical" model, preferring instead "implementation"
model.
. . The truth of the matter is that any data model can be represented
over a range of stages, from conceptual to physical, and at each stage it
conforms to a particular modeling scheme.
On the origin of the fork notation
A student posts:
Dr. Everest, I just caught the fact that you developed the
"fork" notation.
Good thinking, it is the most clear and simple of all 1:M relationship
notations by far!
‑It occurred to me that maybe the profile of a badminton birdie was
your inspiration?
Everest response:
At the time when I did that (in a 1976 paper*, which became chapter 4 in
my Database Management book, McGraw-Hill, 1986, §4.3.2, p.132) my idea came from trying to
resolve the difference between the then dominant diagraming scheme of an arrow
from parent to child (due to Charlie Bachman in a 1969 paper), and Nijssen drawing the arrow in the opposite direction
to represent a function. What I did was
move Nijssen's arrow head to the other end of the arc and voila, a fork!
*Here is the reference (click on the title to get a copy of the paper):
Gordon C. Everest, "BasicData Structure Models Explained With A Common Example" Computing Systems 1976, Proceedings Fifth Texas Conference on Computing Systems, Austin, TX, 1976 October 18-19, pages 39-46. (Long Beach, CA: IEEE Computer Society Publications Office).
Original paper using a notation for relationship characteristics. Thought to be the first to use what is now called a "fork" to represent multiplicity in a relationship. An updated version appeared as Chapter 4 in Database Management: Objectives, System Functions, and Administration, McGraw-Hill, 1986.
2017-12-20
Why number the rules of normalization? Is it helpful?
Chris Date/Ted Codd originally numbered the rules of normalization, so
we have 1NF, 2NF, etc. They were defined in a cascading fashion. For example,
for a data structure to be in 3NF, it must satisfy the 2NF rule, which must in
turn satisfy the 1NF rule. This definition has persisted in virtually all
introductory text books on data management and data modeling, evidently with
the authors not applying any critical thinking to the question. The authors
just blindly accept what has been written before.
So is it possible for a relation (entity record or table) to have no
transitive dependencies (thus satisfying the condition for 3NF) but have a
partial dependency (2NF). Would that situation mean that the record/relation
did not satisfy 3NF? Is it helpful to say that? Even more, when we speak of
nested relations or object modeling (as in UML, for example) they explicitly
violate 1NF allowing nested repeating groups of data items or attributes in a
record/relation. Does that mean that automatically such structures do not
satisfy 2NF or 3NF, even if they had no partial dependencies or transitive
dependencies?
The rules of normalization should be expressed or named by the condition which causes a violation.
1NF = NO multivalued dependency (or multivalued attribute thus requiring a many-to-many relationship).
2NF = NO partial dependency, i.e., the data item "attribute" is only dependent on part of a composite key.
3NF = NO transitive dependency on an intermediate data item.
...
Then the rules of normalization are independent and can be applied in any order.
Subscribe to:
Posts (Atom)