Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2018-01-02

Concerning Data Model Madness: what are we talking about?

Martjin posts (2012/10/11)

..   There have been endless debates on how to name, identify, relate and transform the various kinds of Data Models we can, should, would, must have in the process of designing, developing and managing information systems (like Data warehouses). We talk about conceptual, logical and physical data models, usually in the context of certain tools, platforms, frameworks or methodologies. A confusion of tongues is usually the end result. Recently David Hay has made an interesting video (Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this issue in a consistent and complete manner. But on LinkedIn this was already questioned if this was a final or universal way of looking at such models.
..   One of the caveats is that a 'model' needs operators as well as data containers. Only the relational model (and some Fact oriented modeling techniques, which support conceptual queries) defines operators in a consistent manner. In this respect Entity Relationship techniques are formally diagramming techniques. Disregarding this distinction for now we need to ask ourselves if there are universal rules for kinds of Data Models and their place within a model generation/transformation strategy.
..   I note that if you use e.g. FCO‑IM diagrams you can go directly from conceptual to "physical" DBMS schema if you want to, even special ones like Data Vault or Dimensional Modeling. I also want to remark that there are 'formal language' modeling techniques like Gellish that defy David's classification scheme. They are both ontological and fact driven and could in theory go from ontological to physical in one step without conceptual or logical layer (while still be consistent and complete btw, so no special assumptions except a transformation strategy). The question arises how many data model layers we want or need, and what each layer should solve for us. There is tension between minimizing the amount of layers while at the same time not overloading a layer/model with too much semantics and constructs which hampers its usability.
..   For me this is governed by the concept of concerns, pieces of interest that we try to describe with a specific kind of data model. These can be linguistic concerns like verbalization and translation, semantic concerns like identification, definition and ontology, Data quality constraints like uniqueness or implementation and optimization concerns like physical model transformation (e.g. Data Vault, Dimensional Modeling), partitioning and indexing. Modeling/diagramming techniques and methods usually have primary concerns that they want to solve in a consistent way, and secondary concerns they can model, but not deem important (but that are usually important concerns at another level/layer!). What makes this even more difficult is that within certain kinds of data models there is also the tension between notation, approach and theory (N.A.T. principle). E.g. the relational model is theoretically sound, but the formal notation of nested sets isn't visually helpful. ER diagramming looks good but there is little theoretic foundations beneath it.
..   I personally think we should try to rationalize the use of data model layers, driven by concerns, instead of standardizing on a basic 3 level approach of conceptual, logical, and physical. We should be explicit on the concerns we want to tackle in each layer instead of using generic suggestive layer names.
I would propose the following (data) model layers minimization rule:

A layered (data) modeling scenario supports the concept of separation of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers using a minimum of modeling methodologies and notations.

1 comment:

  1. EVEREST RESPONDS:

    .. Martijn, I think you are on the right track. Rather than levels of data models we need to identify the elements, things, concepts, constructs, (or whatever you might call them) which we introduce into the models. For example, populations of things, names for populations of things, identifiers (lexical surrogates for members of a population of things), relationships (unary, binary, ternary...), characteristics/constraints of relationships (optional/mandatory, multiplicity/exclusivity), representation of relationships (e.g., foreign keys), attributes/properties of things, ring constraints, population constraints, etc. Ideally, we would like to define an ordering on these. Then we identify clusters of those elements to establish levels. For example, in the first stages of FOM/ORM we do not need identifiers (reference modes) but rather we can speak only of populations of things. We do need to have names for populations of things (so we can talk about them). That provides the semantics of those populations. Note that we do not even need to introduce the notion of relationships until after introducing the notion of things/objects/entities. In a model, I can have objects without relationships, but I can't have relationships without objects. This sets a precedence ordering on introducing the elements of the model. If we don't have relationships initially, then we would have no need for foreign keys. In fact in ER we don't even have/need the notion of foreign keys. Foreign keys are a particular method of representing relationships imposed by the relational data modeling scheme. Also, it is unnecessary to introduce the notion of single valued/atomic attributes in ER. This (first normal form) is a constraint applied by the relational model which is done for the purposes of implementation, not modeling the user domain for users. Interesting to note that we often jump to thinking about attributes of things prematurely when we think entity tables. A lot of modeling can be done before putting stuff into tables (witness FOM and ORM). The truth here is that an attribute is an object which plays a role in a relationship with some (other) object. So you can't even have an attribute until you have presumed a relationship. In fact, in the relational model such relationships must be functional dependencies (M:1, or 1:1) so we must have the notion of relationship characteristics/constraints before we can have the notion of attributes in tables. So why not model all the relationships first, then we can say that an object has attributes by being related to other objects.
    .. What I call elements in/of the model, you are calling "concerns." Rather than beating our heads against the wall trying to define stages of data models/modeling, we should try to identify the basic elements of data models/modeling and then establish some sort of precedence ordering on those elements (though never to be completely linear).

    ReplyDelete

Comments to any post are always welcome. I thrive on challenges and it will be more interesting for you.