Everest on Data Modeling: data modeler

Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2020-04-23

All language is referential: Members of a population vs. identifiers

John O.Gorman posts (LinkedIn 2020 April 22)

Since all language is referential I can declare membership of strings based on their usage in communication. I don't need attributes or properties to do so. For example, the string 'John Smith' looks to me to refer to a Person, so I make an ontological commitment to associate that string with that (Person) class. Doing so accomplishes a couple of things: 1. I can use that class anywhere I might want to reference members of that collection.

Everest responds:

John O'Gorman. "All language is referential" - love it and I agree absolutely. Let's remember that we are modeling/representing things in some user domain. We (the designer/modeler) are the ones who define the groupings into populations. That process may seem arbitrary, but is chosen by the designer based on their purposes, and how they wish to view the world. Such groupings do not naturally occur. The world only consists of instances of things. The designer, of course, uses clues in deciding the rules of membership, and it is usually based on our observed characteristics of individual prospective members of a population and what is of interest to us.
.. The task of the modeler is to design a model to (accurately) represent the user domain. The model is essentially an abstract representation, abstract because we use "tokens" to represent things and populations of things in that domain. The dilemma for us is we must find some way to (uniquely) identify individuals (members of populations). You would not be comfortable being put into a data storage device and spinning around on the surface of a disk at 100 mph! So we need a token to serve as a surrogate for you. We call that an identifier. (Criteria for choosing an identifier is a topic for another discussion). Its form is usually some string of characters.
.. The trap we fall into is thinking that a character string, a particular surrogate token, IS the person.
It is not the string "John Smith" you want to associate with the class (or population), it is the actual person. You need to have confidence that the string uniquely identifies or references the person. The operation of our systems and databases depends on it. This is why we need to have a careful definition of the population of things we use in developing our models.
.. Defining populations and the criteria for inclusion and exclusion (and choosing identifiers) are the critical and difficult tasks of a "data" modeler.

2020-03-29

Who judges the accuracy of a data model?

Kevin Feeney says (LinkedIn, Data Modeling, 2020/03/29)
How do we know that we have a correct model? My general take would be that we know the model is correct, if the world moves and the model moves with it without breaking anything - if the model is wrong, you'll find duplication and other errors seep in. In terms of how you set the scope for the model and how you deal with systems integration and different views on data - generally you need to go upwards in abstraction and make the model correct at the top level. For example, when you find that two departments have different concepts of what a customer is, both of which are correct from their points of view, there are implicit subclasses (special types of customer) that can be made explicit to make everything correct.

Everest responds:

Kevin, I see two main problems with your viewpoint. (1) Sounds like you are saying design it, build it, and wait for problems to arise. Surely we need to judge correctness before we commit resources to implementation. That would be irresponsible and dangerous. Before implementing the model and building a database, we need to have some assurance that our model is an accurate representation of the user domain. (2) It sounds like you are depending on the designers/modelers to make judgments about model correctness. That is the last thing I would do. Too often I have found that the data modeling experts had only superficial understanding of the user domain. They may be well versed in the modeling task, but that doesn't produce a good model. The best modeling tool in the world and the best modeling methodology would be insufficient to produce a "correct" data model. Rather than a "correct" model I prefer to call it striving for an "accurate" data model, that is, one that accurately represents the user domain. As Simsion argued and I agree, there is no single correct model. So, who best to judge?

So, who best to judge the "correctness" of a data model? I say, the USERS THEMSELVES. They are the ones who understand their world better than anyone else. But you have to get the right users to the table. I have lead dozens of data modeling projects and we only go to implementation when ALL the user representatives sign off and say "Yes, this is an accurate representation of our world." If there are differences, they must be resolved among themselves (with wise direction from a trained data modeler). One caveat: the users must thoroughly understand the data model, in all its glorious detail (not high-level). This is the responsibility of the data modeler to ensure the users collectively understand all the details of the model -- an awesome responsibility. That means the users must understand the model diagrams and all the supporting documentation, particularly the definition of the "things" (entities, objects), relationships (binary, ternary, and more), and all the associated constraints (e.g., cardinalities). Our goal is to develop as rich a representation as possible of the semantics of the user domain, and that means having a rich set of constructs to use in developing the model. So far, I see ORM as the richest modeling scheme.

The best way to make this happen is for the user representatives to be part of the modeling team. In fact, they should be the ones in control. Upper management needs to grant release time to those users most knowledgeable about their domain. An experienced data modeler needs to facilitate and guide the modeling process and the development of the data model. The team needs to be allowed to meet and deliberate as long as necessary to arrive at a model which they all feel comfortable approving. In my experience the users have always known when they were done (and ready to go to implementation), although the time it took was difficult to predict up front. Only in one project were we unable to come to agreement and that is because we had the wrong user representatives at the table. They were little more than data entry clerks who really didn't understand the meaning of the data, why it was important, nor how it was used.

Everest on Data Modeling

Intro

2020-04-23

All language is referential: Members of a population vs. identifiers

2020-03-29

Who judges the accuracy of a data model?

Blog Archive

About Me