Everest on Data Modeling

Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2020-09-24

Graph Databases and Graph Theory - What is a Graph?

Victor Morgante, in "What is a graph database?" 2020 Sept 15

https://towardsdatascience.com/what-is-a-graph-database-249cd7fdf24d

is trying to get beyond the sales hype for graph databases by getting down to a base definition:

“A graph database is any database over which a graph schema ... can be written”

However, there are many different definitions and conditions for a graph schema which can account for the differences in graph databases.

When we speak of graph databases we are invoking the notion of a graph. At its base, a graph is simply a collection of nodes, representing things, and lines or arcs, each representing a relationship between nodes. If we add that an arc is always between two nodes, then we are restricted to binary relationships.

A critical assumption about graphs in graph theory, and one that is often overlooked or not stated, is that the nodes are homogeneous, at least in some sense, that is, they all belong to one population. The other assumption is that each node is uniquely identifiable and distinguished from all other nodes, that is, there is a name space over the population of nodes.

You have made it clear that the nodes in a graph you are speaking of are at the schema level, that is, every node represents a type or population of instances of things (or domain of values or identifiers)

Beyond that, there are many rules or characteristics we may assume or impose on the formation of a graph:

- Connected – every node must be connected to at least one other node?

- Acyclic – there is at most one path between any two nodes.

- Rooted – there is a single "entry point" into the graph (These three serve to define a tree structure)

- Directed – every arc "points" in one direction (and never both directions).

- Binary only, or allowing a reflexive relationship, a node relating or pointing to itself

- Ternary and higher order relationships involving more than two nodes. E.g, Person possesses Skill at level of Proficiency, or Person enrolled in Class at School earns Grade.

- Naming the arcs to reflect some semantics about the relationship. In the data world, these would be verb phrases or predicates (while the node names are nouns). Note, there is always an inverse reading, whether expressed or not. E.g, Person works in Dept, Dept employs Person.

- Role Names – an alternate name for the node which reflects the role it plays in the relationship. E.g., Person working in a Dept could be called an Employee, Person enrolled in a class is a Student.

- Being at the type or schema level, with each node being a population of instances, it now becomes critically important to specify for each relationship (arc) its cardinality, that is, the mandatory/optional and exclusivity/multiplicity characteristics, and in both directions (in each binary relationship).

In any given context we may go about drawing nodes and arcs without stating our assumptions about the nature or characteristics of the graph. Perhaps it is differences in how these characteristics are manifest in particular graph databases that account for the different schemes. To be sure the differences are significant in our understanding of graph databases in different data modeling schemes and data management systems. Now it makes a difference how you define your graph database. It is overly simplistic to say that every relational database is a graph database.

2020-09-19

Definition of Arc in Graph Theory

I have always used the term "arc" to refer to a line drawn between two nodes to represent a relationship between the nodes in a diagram. I never meant it to be directional. I apologize if people were confused. In graph theory, an arc is defined as "an ordered pair." That means it has direction -- the arc (A-B) is different from the arc (B-A), even if we do not include an arrowhead or some other notation. How confusing! There is some inconsistency in graph theory, since they use the term "directed" graph or digraph when the arcs all have a direction. Then they always draw the arcs as an arrow... go figure.

I suppose I could be using "line" or "edge" which would be more correct in graph theory.

Well, I am not about to go back an revise everything I have written or in data model diagrams I have ever drawn! So be advised. When I say "arc" or draw one between two nodes, it never implies directionality. If direction is important I have always included some additional notation to indicate the direction, as for cardinality, a physical pointer, or a predicate reading.

2020-09-13

Concept Modeling

Thomas Frisendal posted a nice piece on concept(ual) modeling: (2020 April 13)
www.dataversity.net/the-conceptual-model-strikes-back/#
In it he said that there were common ingredients to all conceptual modeling schemes:

Concepts
Properties
Relationships
The Triple (subject – predicate – object)
Occasional elements of semantic sugar such as cardinalities, class systems and various other abstractions.

I would like to suggest a few clarifications or modifications to this list as it might pertain to business "data" modeling, recognizing that we are modeling some aspect of a business/user domain (which may be represented in a stored database).

1. Note that "Concept" is a noun and each named concept represents a population of instances of some class of things. It is not just an abstract notion of some idea. Conceptual is an adjective. The key here for modeling is that Concepts need to be clearly defined so we know what instances are included and which are excluded from the population.

2. The notion of Property is derived (often called an "attribute" in data modeling). Here is my definition: An Attribute is a Concept (or Object) playing a Role in a Relationship with some (other) Concept. As an example, we can have the Concept of a Date, define the population of dates, and perhaps the format of its lexical representation. Now with the Concept of Employee, we could have the Concept of Birthdate (or HireDate or...). That would necessitate a Relationship between the two Concepts. In that Relationship, Birthdate is a Role name for Date. We could also have a predicate phrase to name the relationship, such as "is born on" (we could also have an inverse reading). The Relationship between Concepts comes first, before we can speak of Properties. Properties can exist all by themselves without having a relationship with a Concept, precisely because it is first, a Concept. In fact modeling circles, some people speak of an "attributeless" modeling scheme.

3. Recognize that Relationships can be more than binary. Nijssen's first paper (1976) was on Binary Modeling. It later became Fact modeling, recognizing that Facts could be unary, ternary, or more.

4. Saying that a common ingredient is the Triple, restricts us to binary relationships. In Halpin's ORM a predicate can be of any arbitrary arity. Higher order relationships are represented by objectifying a predicate. This is equivalent to having a separate table in a relational data model to represent even a many-to-many binary relationship. This stems from the first normal form restriction which says that an "attribute" can be at most single valued. That means it is possible to directly represent at most a 1:Many relationship. Since this restriction is already a step toward implementation in a relational database, I argue that it has no place in a business concept "data" model. People don't have difficulty comprehending a M:N or even a ternary relationship, even if our relational data management systems do!

5. For the occasional elements I call them all constraints, or more generally business rules. This includes mandatory/optional and exclusive/multiple (what you call cardinality), and so much more, particularly conditional constraints (which is not even possible in even the more advanced DBMSs). In fact, our overarching goal in concept modeling is to capture and formally express as rich a set of semantics as possible about the user domain. Lacking expressible semantics in our models we are doomed to data quality issues. One of the best examples of rich data semantics capture is in Halpin's ORM flavor of fact modeling.

2020-09-08

Modeling a Many-to-Many reflexive relationship

Response to John Sullivan post on LinkedIn, 2020 Sept 3.

His paper: https://www.linkedin.com/pulse/better-database-design-part-1-john-sullivan/

shows an entity labeled Organization Structure having two defined relationships with an Organization entity. The relationship arcs are labeled Parent Organization and Child Organization. In the descriptive text he states that the Organization Structure entity has two fields: Parent Org ID, and Child Org ID. Each field in a foreign key to its respective Organization entry, and together they form a composite key for the Organization Structure entity.

Parent Organization

Organization

Structure

>O-----------------------|-

Child Organization

Everest responds:

John, I hate to say it but you have fallen into the TABLE THINK trap, as has everyone else who tries to model Organizational Structure this way. You tell me how many business users will understand what you have called the "Organization Structure" entity in the diagram. You are right to say they should be greyed out. In fact, they should not be there at all. They are there solely for the purpose of implementation (in a relational DBMS). How many times do we say, "the logical model should be independent, i.e., show nothing related to implementation or physical storage? The reality is that this representation is forced when you have to implement in 1NF relations (tables) – a relational model can only directly represent at most a 1:Many binary relationship. That is because, in a relational database, it is not possible to directly represent a many-to-many relationship – all attributes must be at most single valued. The proper logical diagrammatic representation would be an arc representing a relationship on the Organization entity to itself with a fork (for manyness) at each end of the arc. The labels on the arc are role names for each Organization as they participate in the relationship.

Organization

Parent

>-----------

v |

|___Child_______|

Most people will have no difficulty understanding the notion of a many-to-many relationship, once they understand the meaning of the fork (already used in IE notation). Then we need to label the ends of the arc with the ROLE each instance plays in the relationship, either parent or child. We also need to apply some constraint declarations to this relationship. For example, we probably don't want to allow an organization to be both parent and child in the same relationship instance. We may also want to exclude a circular chain of parent-child relationships. As an aside, let me say that all of this is handled quite nicely and precisely in Halpin's Object Role Modeling (ORM) scheme, a variant of fact modeling, and the focus of my years of teaching advanced data modeling at the University of Minnesota.

2020-04-25

Are we really modeling data?

Even the title of this blog is a little misleading. Perhaps I should call it
Everest on "Data" Modeling!
I never did like the phrase "data" modeling. It suggests that it is a "model of data." That is misleading to someone outside of our community. At the heart of it we are not modeling data, we are modeling some user domain. It is only a model of data if we have some data. Then the model would be an (abstract) "representation" of the data. For us, as data modelers, a "data" model is a model of some aspects of a domain of interest to a community of users, real (world) or imagined/desired/yet to be built. It is a model expressed "in data," that is, built using informational constructs, all guided by a modeling scheme. The modeling scheme tells us what to look for in the domain and how to represent it in the model. So we identify a population of similar things, give it a label and a definition, and put a box or circle into a diagram to represent that population of things. We build up or "design" a model with lots of types of things, add relationships among those things, and constraints on those things and relationships. The modeling scheme tells us how to represent those relationships and constraints in our model.

2020-04-23

Is defining entity/object type populations arbitrary?

I have been saying (see my video lecture on Subtypes/Supertypes on YouTube) that defining populations is essentially "arbitrary", a choice made by the designer in modeling some user domain. This is to contrast with a common view that there is one correct model and it is the good designer who can find it!

Fabian Pascal commented: I would not refer to it as "arbitrary", but pragmatic -- to incorporate perceptions and serve application needs of users.

I dislike the word "arbitrary" too. Fabian is absolutely correct. Sometimes it may seem arbitrary to others, but it is actually a choice made by the designer(s) based on the purpose(s) of the model. Since the world presents itself to us as only made up of instances (individual things), the modeler must choose how to define the grouping of things into populations or classes. This is part of the abstraction process in modeling. As an example, one may be interested in animals and a cow would be an instance. But a dairy farmer probably wants to keep information on each individual cow. As another example, a designer may define a population of Employee, but that may include or exclude applicants, furloughed, part-time, contract, retired, etc. depending on your purpose.

.. To arrive at an efficient representation of the user domain, we group individual instances into populations so that we can define and manipulate its members in the same way. Efficiency goes for (1) our own human ability to grasp an understanding of our world, (2) our ability to define relationships, attributes, constraints, etc. which apply uniformly to members of a population, and (3) our processing the representation of that world in our databases.

.. Thanks, Fabian, for calling me out on that one. I am not sure I would use the term "pragmatic" since that implies the designer's choice is matter-of-fact, obvious, cut and dried, unequivocal, or settled. However, there may still be disagreement and alternative designs. Anyone suggest a better word?

Concept Modeling vs. Conceptual Data Modeling

George McGeachie posts (LinkedIn, 2020 April 22)
To Fabian Pascal: Broadly speaking, I agree with your three levels [conceptual, logical, physical], but please don't assume that my view of 'data modelling' is limited to logical database design. "Conceptual" modelling is better described as "Concept" modeling (see the works of Alec Sharp and Ronald Ross), so people don't mentally tag the word 'data' on the end (Conceptual Data Modelling). Where do we stop modelling 'Concepts' and start modelling 'Data'? It's probably somewhere in the top level of representation, before Logical Database Design.

To which Everest responds:

Thanks, George. I like that. I never did like the phrase "data" modeling because. at the heart of it we are not modeling data, we are modeling "things" in the user domain. Now I can say we are modeling "concepts" in the user domain. Modeling "things" suggests only nouns, whereas modeling "concepts" can include constraints, "business" rules, modifiers, roles, subsets, etc. Thus, concept modeling can be very rigorous, richer, and detailed (as in ORM). But we can recognize that this is a prelude to modeling "data" as it would ultimately be manifest in a "logical" data model. Perhaps that helps me better understand what Fabian keeps saying when distinguishing "conceptual" from "logical" data modeling... and his "logical" is (his version of) relational/RDM which is already a step toward implementation (I would argue) simply because it is clustering "attributes" into (1NF) relations.

All language is referential: Members of a population vs. identifiers

John O.Gorman posts (LinkedIn 2020 April 22)

Since all language is referential I can declare membership of strings based on their usage in communication. I don't need attributes or properties to do so. For example, the string 'John Smith' looks to me to refer to a Person, so I make an ontological commitment to associate that string with that (Person) class. Doing so accomplishes a couple of things: 1. I can use that class anywhere I might want to reference members of that collection.

Everest responds:

John O'Gorman. "All language is referential" - love it and I agree absolutely. Let's remember that we are modeling/representing things in some user domain. We (the designer/modeler) are the ones who define the groupings into populations. That process may seem arbitrary, but is chosen by the designer based on their purposes, and how they wish to view the world. Such groupings do not naturally occur. The world only consists of instances of things. The designer, of course, uses clues in deciding the rules of membership, and it is usually based on our observed characteristics of individual prospective members of a population and what is of interest to us.
.. The task of the modeler is to design a model to (accurately) represent the user domain. The model is essentially an abstract representation, abstract because we use "tokens" to represent things and populations of things in that domain. The dilemma for us is we must find some way to (uniquely) identify individuals (members of populations). You would not be comfortable being put into a data storage device and spinning around on the surface of a disk at 100 mph! So we need a token to serve as a surrogate for you. We call that an identifier. (Criteria for choosing an identifier is a topic for another discussion). Its form is usually some string of characters.
.. The trap we fall into is thinking that a character string, a particular surrogate token, IS the person.
It is not the string "John Smith" you want to associate with the class (or population), it is the actual person. You need to have confidence that the string uniquely identifies or references the person. The operation of our systems and databases depends on it. This is why we need to have a careful definition of the population of things we use in developing our models.
.. Defining populations and the criteria for inclusion and exclusion (and choosing identifiers) are the critical and difficult tasks of a "data" modeler.

2020-04-21

In a data model: nouns and predicates

John O'Gorman asks (LinkedIn Data Modeling 2020 April)

Why do data models only include nouns? Second, the word 'Status' is a noun, right? If I use it as the name of a set, could I include the words 'Active', 'Inactive', 'Stalled', and 'Inverted' as members of the set? If so, could I include them in a data model as Concepts even though they are clearly not nouns?

Ken Evans answers:

Not true. A proper data model has nouns and predicates that define the relationships between the nouns.

Everest responds:
First, for data modeling, I note that a noun implies a population of "things". Perhaps the hardest but most important part of building a data model is in defining the members of that population so we can always determine what is included and what is excluded from the population.
Not only nouns and predicates (verb phrases) but also adjectives, understanding that an adjective serves to restrict the population of the noun. A noun qualified by an adjective would name a subset of the noun population, e.g., Employee, and full-time Employee.

2020-04-19

Multifactor authentication, with examples.

Generally there are three types of "factors" to consider: Something you HAVE, something you KNOW, and something you ARE (unique personal characteristics). It is possible to buy a safe in which you need to HAVE a key, and you need to KNOW the sequence of numbers to enter using a keypad. Today two factor authentication using computers is generally KNOWing a password and HAVing a mobile phone. The third (something you ARE) is biometrics - fingerprint, iris scan, facial scan, genome, etc. Anyone know what might be considered a fourth factor? For a fuller description and several examples, see my text book, Database Management, McGraw-Hill, 1986, section 14.4 (pages 515-529). As a mini-test (my being a teacher) how would you classify each of the following: handwriting, hand geometry, voice, key (on a key ring), ID card, personal history, combination lock, password, fingerprint. (LinkedIn Data Modeling 2020 April)

Jan Berdnik comments:
Better: ONLY you know and ONLY you have. As with any genuine authorising.

Everest responds:
Better maybe, but can't be guaranteed. That is precisely why multifactor authentication is so much better -- it can be orders of magnitude more secure, than trying so hard to come up with passwords difficult to guess or remember. Some organizations are going too far with the rules for constructing passwords making it a real burden for users.

2020-04-01

Thinking about Attributes or Properties

Kevin Feeney (LinkedIn, Data Modeling, 2020/03/24)
In his presentation on data modeling (https://lnkd.in/dnwYTEY) says that an accurate data model defines Things, Properties of things, how things are Identified, and Relationships.

Everest says:
A caution when thinking about Properties. You cannot define an Attribute until you first have (or presume) a Relationship. An Attribute is a thing with a population (a domain of values). So ORM does not distinguish, it calls them both Objects. For example, I could have a thing called a Skill Code and Employees have Skills. That means there is a relationship between Employee and Skill. We often depict an Attribute being tucked away in a box for the Employee. This naturally leads to (thinking about) putting it in a column in a table for the Employee entity. That can lead to problems. In ORM we defer thinking about tables since that is really a step toward implementation (in a Relational DBMS). Better to think in terms of two objects, Employee and Skill, with a relationship between them. So here is the definition: An ATTRIBUTE (or Property) is an OBJECT which plays a ROLE in a RELATIONSHIP with another OBJECT. Now we can add cardinality to the relationship. In fact, in this example, if an Employee can possess multiple Skills there is a M:N relationship and Skill cannot be stored in an Employee table (it would violate First Normal Form). But Skill is no less an attribute of Employee, even if it is not stored in the Employee table. That further reinforces the fact that an OBJECT has ATTRIBUTES by virtue of having RELATIONSHIPS with other OBJECTS. Hence, there is no need for an Attribute artifact in a data model.

2020-03-29

How a data model is expressed

Fabian Pascal posts (LinkedIn, Data Modeling, 2020/03/29)
Shown a data model diagram, he says "Actually, it is not a logical model, but a graphical representation of it that users understand -- the DBMS doesn't. Users understand the conceptual model SEMANTICALLY, the DBMS "understands" the true logical model ALGORITHMICALLY and that's not what your drawing is.

Everest responds:

Regarding model (re)presentation, the exact same model can be presented in a variety of ways - graphical diagram, formal linear (for machine processing, I take you call this "logical"). But I would hope and expect that the underlying semantics would be exactly the same. We build a data model, and it should not matter how it is expressed, as long as they are equivalent. The semantics relate to the model, not how it is expressed (Conceptual?). Semantics is the meaning behind what is presented. Furthermore, all models are logical, that is, built according to some set of rules of logic and formal argument, no matter how it is presented or who reads it (check your dictionary). The rules in this case constitute what I call the modeling scheme.

Picking the right users for a data modeling project

Ken Evans said (LinkedIn, Data Modeling, 2020/03/29)
The piece of the puzzle that you have not mentioned is where the users understanding of their domain is rather vague.

Everest responds:

If you have users with a vague understanding of their domain, you are talking to the wrong people. I have found that there are people in the user domain who really do know what is going on, what their world looks like. They are often seen as troublemakers, asking the tough questions, complaining about how things are done (or not done) and suggesting how things could be done better. People on the front lines working in the trenches who actually think about what they are doing, are not satisfied with the status quo, going beyond their job description. Every organization has such people; you just need to find them. And the best way to find them is to ask other users. Most can readily tell you who they are. If there is no one they can point to, you have a dead or dieing organization where nobody cares. The people who fit the profile above will generally not be management or senior level -- who are usually tending to managing and training people, as they should be. Once you get the right people to the table it takes a skilled facilitator to elicit the needed information and document it in a usable form, i.e. a data model, such that they understand and concur with the representation.

Who judges the accuracy of a data model?

Kevin Feeney says (LinkedIn, Data Modeling, 2020/03/29)
How do we know that we have a correct model? My general take would be that we know the model is correct, if the world moves and the model moves with it without breaking anything - if the model is wrong, you'll find duplication and other errors seep in. In terms of how you set the scope for the model and how you deal with systems integration and different views on data - generally you need to go upwards in abstraction and make the model correct at the top level. For example, when you find that two departments have different concepts of what a customer is, both of which are correct from their points of view, there are implicit subclasses (special types of customer) that can be made explicit to make everything correct.

Everest responds:

Kevin, I see two main problems with your viewpoint. (1) Sounds like you are saying design it, build it, and wait for problems to arise. Surely we need to judge correctness before we commit resources to implementation. That would be irresponsible and dangerous. Before implementing the model and building a database, we need to have some assurance that our model is an accurate representation of the user domain. (2) It sounds like you are depending on the designers/modelers to make judgments about model correctness. That is the last thing I would do. Too often I have found that the data modeling experts had only superficial understanding of the user domain. They may be well versed in the modeling task, but that doesn't produce a good model. The best modeling tool in the world and the best modeling methodology would be insufficient to produce a "correct" data model. Rather than a "correct" model I prefer to call it striving for an "accurate" data model, that is, one that accurately represents the user domain. As Simsion argued and I agree, there is no single correct model. So, who best to judge?

So, who best to judge the "correctness" of a data model? I say, the USERS THEMSELVES. They are the ones who understand their world better than anyone else. But you have to get the right users to the table. I have lead dozens of data modeling projects and we only go to implementation when ALL the user representatives sign off and say "Yes, this is an accurate representation of our world." If there are differences, they must be resolved among themselves (with wise direction from a trained data modeler). One caveat: the users must thoroughly understand the data model, in all its glorious detail (not high-level). This is the responsibility of the data modeler to ensure the users collectively understand all the details of the model -- an awesome responsibility. That means the users must understand the model diagrams and all the supporting documentation, particularly the definition of the "things" (entities, objects), relationships (binary, ternary, and more), and all the associated constraints (e.g., cardinalities). Our goal is to develop as rich a representation as possible of the semantics of the user domain, and that means having a rich set of constructs to use in developing the model. So far, I see ORM as the richest modeling scheme.

The best way to make this happen is for the user representatives to be part of the modeling team. In fact, they should be the ones in control. Upper management needs to grant release time to those users most knowledgeable about their domain. An experienced data modeler needs to facilitate and guide the modeling process and the development of the data model. The team needs to be allowed to meet and deliberate as long as necessary to arrive at a model which they all feel comfortable approving. In my experience the users have always known when they were done (and ready to go to implementation), although the time it took was difficult to predict up front. Only in one project were we unable to come to agreement and that is because we had the wrong user representatives at the table. They were little more than data entry clerks who really didn't understand the meaning of the data, why it was important, nor how it was used.

2020-03-26

Determining the "Correctness" of a data model

LinkedIn, Data modeling, 2020/3/26
I asked: How do we know when a data model is correct?
Ken Evans responded: That's easy, when the model conforms to stated requirements.
I then asked "who determines/documents the requirements?"
Ken responded:
It does not matter "who" determines the requirements. The point is that you can only judge whether a deliverable is "correct" if you have a set of pre-established requirements against which to assess "correctness". This principle has been widely accepted in the quality management discipline since 1978 when Philip Crosby published his book "Quality is Free." Crosby makes the point that "Quality is conformance to requirements."

Everest Responds:

Sorry about the "who", perhaps I should have said "how." While I accept the general principle, is it realistic? Crosby's statement begs the question, if quality is conformance to requirements, then who/how do we determine the correctness and completeness of the stated requirements? I have yet to see anything close to an a priori statement of requirements that was sufficient to judge the correctness of the end result, i.e., a domain data model. Furthermore, I have yet to see any guidelines sufficient for preparing a statement of requirements for a data modeling project. I would love to see any examples.

.. To me, the only satisfactory "statement of requirements" sufficient to judge the correctness of the model would be the final, detailed data model itself. Anything less than that would not be sufficient to express the full set of semantics to be included in the final model. In the case of ORM perhaps, a complete set of elementary fact sentences, with well defined object types and predicates. But that is what the entire process of data modeling is all about -- to discover and document the semantics of some user domain. We want to capture at least as rich a set of semantics as possible given our modeling scheme (which is why we need to use a modeling scheme such as ORM which captures many more semantics than any other scheme, including ER/relational).

.. So the question remains, whether we are talking about the data model, or the requirements for a data model -- how do we judge the correctness of a data model? Who is in the best position to do that?