Everest on Data Modeling: ORM

Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

Showing posts with label ORM. Show all posts

2020-09-13

Concept Modeling

Thomas Frisendal posted a nice piece on concept(ual) modeling: (2020 April 13)
www.dataversity.net/the-conceptual-model-strikes-back/#
In it he said that there were common ingredients to all conceptual modeling schemes:

Concepts
Properties
Relationships
The Triple (subject – predicate – object)
Occasional elements of semantic sugar such as cardinalities, class systems and various other abstractions.

I would like to suggest a few clarifications or modifications to this list as it might pertain to business "data" modeling, recognizing that we are modeling some aspect of a business/user domain (which may be represented in a stored database).

1. Note that "Concept" is a noun and each named concept represents a population of instances of some class of things. It is not just an abstract notion of some idea. Conceptual is an adjective. The key here for modeling is that Concepts need to be clearly defined so we know what instances are included and which are excluded from the population.

2. The notion of Property is derived (often called an "attribute" in data modeling). Here is my definition: An Attribute is a Concept (or Object) playing a Role in a Relationship with some (other) Concept. As an example, we can have the Concept of a Date, define the population of dates, and perhaps the format of its lexical representation. Now with the Concept of Employee, we could have the Concept of Birthdate (or HireDate or...). That would necessitate a Relationship between the two Concepts. In that Relationship, Birthdate is a Role name for Date. We could also have a predicate phrase to name the relationship, such as "is born on" (we could also have an inverse reading). The Relationship between Concepts comes first, before we can speak of Properties. Properties can exist all by themselves without having a relationship with a Concept, precisely because it is first, a Concept. In fact modeling circles, some people speak of an "attributeless" modeling scheme.

3. Recognize that Relationships can be more than binary. Nijssen's first paper (1976) was on Binary Modeling. It later became Fact modeling, recognizing that Facts could be unary, ternary, or more.

4. Saying that a common ingredient is the Triple, restricts us to binary relationships. In Halpin's ORM a predicate can be of any arbitrary arity. Higher order relationships are represented by objectifying a predicate. This is equivalent to having a separate table in a relational data model to represent even a many-to-many binary relationship. This stems from the first normal form restriction which says that an "attribute" can be at most single valued. That means it is possible to directly represent at most a 1:Many relationship. Since this restriction is already a step toward implementation in a relational database, I argue that it has no place in a business concept "data" model. People don't have difficulty comprehending a M:N or even a ternary relationship, even if our relational data management systems do!

5. For the occasional elements I call them all constraints, or more generally business rules. This includes mandatory/optional and exclusive/multiple (what you call cardinality), and so much more, particularly conditional constraints (which is not even possible in even the more advanced DBMSs). In fact, our overarching goal in concept modeling is to capture and formally express as rich a set of semantics as possible about the user domain. Lacking expressible semantics in our models we are doomed to data quality issues. One of the best examples of rich data semantics capture is in Halpin's ORM flavor of fact modeling.

2020-09-08

Modeling a Many-to-Many reflexive relationship

Response to John Sullivan post on LinkedIn, 2020 Sept 3.

His paper: https://www.linkedin.com/pulse/better-database-design-part-1-john-sullivan/

shows an entity labeled Organization Structure having two defined relationships with an Organization entity. The relationship arcs are labeled Parent Organization and Child Organization. In the descriptive text he states that the Organization Structure entity has two fields: Parent Org ID, and Child Org ID. Each field in a foreign key to its respective Organization entry, and together they form a composite key for the Organization Structure entity.

Parent Organization

Organization

Structure

>O-----------------------|-

Child Organization

Everest responds:

John, I hate to say it but you have fallen into the TABLE THINK trap, as has everyone else who tries to model Organizational Structure this way. You tell me how many business users will understand what you have called the "Organization Structure" entity in the diagram. You are right to say they should be greyed out. In fact, they should not be there at all. They are there solely for the purpose of implementation (in a relational DBMS). How many times do we say, "the logical model should be independent, i.e., show nothing related to implementation or physical storage? The reality is that this representation is forced when you have to implement in 1NF relations (tables) – a relational model can only directly represent at most a 1:Many binary relationship. That is because, in a relational database, it is not possible to directly represent a many-to-many relationship – all attributes must be at most single valued. The proper logical diagrammatic representation would be an arc representing a relationship on the Organization entity to itself with a fork (for manyness) at each end of the arc. The labels on the arc are role names for each Organization as they participate in the relationship.

Organization

Parent

>-----------

v |

|___Child_______|

Most people will have no difficulty understanding the notion of a many-to-many relationship, once they understand the meaning of the fork (already used in IE notation). Then we need to label the ends of the arc with the ROLE each instance plays in the relationship, either parent or child. We also need to apply some constraint declarations to this relationship. For example, we probably don't want to allow an organization to be both parent and child in the same relationship instance. We may also want to exclude a circular chain of parent-child relationships. As an aside, let me say that all of this is handled quite nicely and precisely in Halpin's Object Role Modeling (ORM) scheme, a variant of fact modeling, and the focus of my years of teaching advanced data modeling at the University of Minnesota.

2020-04-01

Thinking about Attributes or Properties

Kevin Feeney (LinkedIn, Data Modeling, 2020/03/24)
In his presentation on data modeling (https://lnkd.in/dnwYTEY) says that an accurate data model defines Things, Properties of things, how things are Identified, and Relationships.

Everest says:
A caution when thinking about Properties. You cannot define an Attribute until you first have (or presume) a Relationship. An Attribute is a thing with a population (a domain of values). So ORM does not distinguish, it calls them both Objects. For example, I could have a thing called a Skill Code and Employees have Skills. That means there is a relationship between Employee and Skill. We often depict an Attribute being tucked away in a box for the Employee. This naturally leads to (thinking about) putting it in a column in a table for the Employee entity. That can lead to problems. In ORM we defer thinking about tables since that is really a step toward implementation (in a Relational DBMS). Better to think in terms of two objects, Employee and Skill, with a relationship between them. So here is the definition: An ATTRIBUTE (or Property) is an OBJECT which plays a ROLE in a RELATIONSHIP with another OBJECT. Now we can add cardinality to the relationship. In fact, in this example, if an Employee can possess multiple Skills there is a M:N relationship and Skill cannot be stored in an Employee table (it would violate First Normal Form). But Skill is no less an attribute of Employee, even if it is not stored in the Employee table. That further reinforces the fact that an OBJECT has ATTRIBUTES by virtue of having RELATIONSHIPS with other OBJECTS. Hence, there is no need for an Attribute artifact in a data model.

2020-03-29

Who judges the accuracy of a data model?

Kevin Feeney says (LinkedIn, Data Modeling, 2020/03/29)
How do we know that we have a correct model? My general take would be that we know the model is correct, if the world moves and the model moves with it without breaking anything - if the model is wrong, you'll find duplication and other errors seep in. In terms of how you set the scope for the model and how you deal with systems integration and different views on data - generally you need to go upwards in abstraction and make the model correct at the top level. For example, when you find that two departments have different concepts of what a customer is, both of which are correct from their points of view, there are implicit subclasses (special types of customer) that can be made explicit to make everything correct.

Everest responds:

Kevin, I see two main problems with your viewpoint. (1) Sounds like you are saying design it, build it, and wait for problems to arise. Surely we need to judge correctness before we commit resources to implementation. That would be irresponsible and dangerous. Before implementing the model and building a database, we need to have some assurance that our model is an accurate representation of the user domain. (2) It sounds like you are depending on the designers/modelers to make judgments about model correctness. That is the last thing I would do. Too often I have found that the data modeling experts had only superficial understanding of the user domain. They may be well versed in the modeling task, but that doesn't produce a good model. The best modeling tool in the world and the best modeling methodology would be insufficient to produce a "correct" data model. Rather than a "correct" model I prefer to call it striving for an "accurate" data model, that is, one that accurately represents the user domain. As Simsion argued and I agree, there is no single correct model. So, who best to judge?

So, who best to judge the "correctness" of a data model? I say, the USERS THEMSELVES. They are the ones who understand their world better than anyone else. But you have to get the right users to the table. I have lead dozens of data modeling projects and we only go to implementation when ALL the user representatives sign off and say "Yes, this is an accurate representation of our world." If there are differences, they must be resolved among themselves (with wise direction from a trained data modeler). One caveat: the users must thoroughly understand the data model, in all its glorious detail (not high-level). This is the responsibility of the data modeler to ensure the users collectively understand all the details of the model -- an awesome responsibility. That means the users must understand the model diagrams and all the supporting documentation, particularly the definition of the "things" (entities, objects), relationships (binary, ternary, and more), and all the associated constraints (e.g., cardinalities). Our goal is to develop as rich a representation as possible of the semantics of the user domain, and that means having a rich set of constructs to use in developing the model. So far, I see ORM as the richest modeling scheme.

The best way to make this happen is for the user representatives to be part of the modeling team. In fact, they should be the ones in control. Upper management needs to grant release time to those users most knowledgeable about their domain. An experienced data modeler needs to facilitate and guide the modeling process and the development of the data model. The team needs to be allowed to meet and deliberate as long as necessary to arrive at a model which they all feel comfortable approving. In my experience the users have always known when they were done (and ready to go to implementation), although the time it took was difficult to predict up front. Only in one project were we unable to come to agreement and that is because we had the wrong user representatives at the table. They were little more than data entry clerks who really didn't understand the meaning of the data, why it was important, nor how it was used.

2018-01-02

Concerning Data Model Madness: what are we talking about?

Martjin posts (2012/10/11)

.. There have been endless debates on how to name, identify, relate and transform the various kinds of Data Models we can, should, would, must have in the process of designing, developing and managing information systems (like Data warehouses). We talk about conceptual, logical and physical data models, usually in the context of certain tools, platforms, frameworks or methodologies. A confusion of tongues is usually the end result. Recently David Hay has made an interesting video (Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this issue in a consistent and complete manner. But on LinkedIn this was already questioned if this was a final or universal way of looking at such models.

.. One of the caveats is that a 'model' needs operators as well as data containers. Only the relational model (and some Fact oriented modeling techniques, which support conceptual queries) defines operators in a consistent manner. In this respect Entity Relationship techniques are formally diagramming techniques. Disregarding this distinction for now we need to ask ourselves if there are universal rules for kinds of Data Models and their place within a model generation/transformation strategy.

.. I note that if you use e.g. FCO‑IM diagrams you can go directly from conceptual to "physical" DBMS schema if you want to, even special ones like Data Vault or Dimensional Modeling. I also want to remark that there are 'formal language' modeling techniques like Gellish that defy David's classification scheme. They are both ontological and fact driven and could in theory go from ontological to physical in one step without conceptual or logical layer (while still be consistent and complete btw, so no special assumptions except a transformation strategy). The question arises how many data model layers we want or need, and what each layer should solve for us. There is tension between minimizing the amount of layers while at the same time not overloading a layer/model with too much semantics and constructs which hampers its usability.

.. For me this is governed by the concept of concerns, pieces of interest that we try to describe with a specific kind of data model. These can be linguistic concerns like verbalization and translation, semantic concerns like identification, definition and ontology, Data quality constraints like uniqueness or implementation and optimization concerns like physical model transformation (e.g. Data Vault, Dimensional Modeling), partitioning and indexing. Modeling/diagramming techniques and methods usually have primary concerns that they want to solve in a consistent way, and secondary concerns they can model, but not deem important (but that are usually important concerns at another level/layer!). What makes this even more difficult is that within certain kinds of data models there is also the tension between notation, approach and theory (N.A.T. principle). E.g. the relational model is theoretically sound, but the formal notation of nested sets isn't visually helpful. ER diagramming looks good but there is little theoretic foundations beneath it.

.. I personally think we should try to rationalize the use of data model layers, driven by concerns, instead of standardizing on a basic 3 level approach of conceptual, logical, and physical. We should be explicit on the concerns we want to tackle in each layer instead of using generic suggestive layer names.

I would propose the following (data) model layers minimization rule:

A layered (data) modeling scenario supports the concept of separation of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers using a minimum of modeling methodologies and notations.

Everest on Data Modeling