Everest on Data Modeling: September 2020

Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2020-09-24

Graph Databases and Graph Theory - What is a Graph?

Victor Morgante, in "What is a graph database?" 2020 Sept 15

https://towardsdatascience.com/what-is-a-graph-database-249cd7fdf24d

is trying to get beyond the sales hype for graph databases by getting down to a base definition:

“A graph database is any database over which a graph schema ... can be written”

However, there are many different definitions and conditions for a graph schema which can account for the differences in graph databases.

When we speak of graph databases we are invoking the notion of a graph. At its base, a graph is simply a collection of nodes, representing things, and lines or arcs, each representing a relationship between nodes. If we add that an arc is always between two nodes, then we are restricted to binary relationships.

A critical assumption about graphs in graph theory, and one that is often overlooked or not stated, is that the nodes are homogeneous, at least in some sense, that is, they all belong to one population. The other assumption is that each node is uniquely identifiable and distinguished from all other nodes, that is, there is a name space over the population of nodes.

You have made it clear that the nodes in a graph you are speaking of are at the schema level, that is, every node represents a type or population of instances of things (or domain of values or identifiers)

Beyond that, there are many rules or characteristics we may assume or impose on the formation of a graph:

- Connected – every node must be connected to at least one other node?

- Acyclic – there is at most one path between any two nodes.

- Rooted – there is a single "entry point" into the graph (These three serve to define a tree structure)

- Directed – every arc "points" in one direction (and never both directions).

- Binary only, or allowing a reflexive relationship, a node relating or pointing to itself

- Ternary and higher order relationships involving more than two nodes. E.g, Person possesses Skill at level of Proficiency, or Person enrolled in Class at School earns Grade.

- Naming the arcs to reflect some semantics about the relationship. In the data world, these would be verb phrases or predicates (while the node names are nouns). Note, there is always an inverse reading, whether expressed or not. E.g, Person works in Dept, Dept employs Person.

- Role Names – an alternate name for the node which reflects the role it plays in the relationship. E.g., Person working in a Dept could be called an Employee, Person enrolled in a class is a Student.

- Being at the type or schema level, with each node being a population of instances, it now becomes critically important to specify for each relationship (arc) its cardinality, that is, the mandatory/optional and exclusivity/multiplicity characteristics, and in both directions (in each binary relationship).

In any given context we may go about drawing nodes and arcs without stating our assumptions about the nature or characteristics of the graph. Perhaps it is differences in how these characteristics are manifest in particular graph databases that account for the different schemes. To be sure the differences are significant in our understanding of graph databases in different data modeling schemes and data management systems. Now it makes a difference how you define your graph database. It is overly simplistic to say that every relational database is a graph database.

2020-09-19

Definition of Arc in Graph Theory

I have always used the term "arc" to refer to a line drawn between two nodes to represent a relationship between the nodes in a diagram. I never meant it to be directional. I apologize if people were confused. In graph theory, an arc is defined as "an ordered pair." That means it has direction -- the arc (A-B) is different from the arc (B-A), even if we do not include an arrowhead or some other notation. How confusing! There is some inconsistency in graph theory, since they use the term "directed" graph or digraph when the arcs all have a direction. Then they always draw the arcs as an arrow... go figure.

I suppose I could be using "line" or "edge" which would be more correct in graph theory.

Well, I am not about to go back an revise everything I have written or in data model diagrams I have ever drawn! So be advised. When I say "arc" or draw one between two nodes, it never implies directionality. If direction is important I have always included some additional notation to indicate the direction, as for cardinality, a physical pointer, or a predicate reading.

2020-09-13

Concept Modeling

Thomas Frisendal posted a nice piece on concept(ual) modeling: (2020 April 13)
www.dataversity.net/the-conceptual-model-strikes-back/#
In it he said that there were common ingredients to all conceptual modeling schemes:

Concepts
Properties
Relationships
The Triple (subject – predicate – object)
Occasional elements of semantic sugar such as cardinalities, class systems and various other abstractions.

I would like to suggest a few clarifications or modifications to this list as it might pertain to business "data" modeling, recognizing that we are modeling some aspect of a business/user domain (which may be represented in a stored database).

1. Note that "Concept" is a noun and each named concept represents a population of instances of some class of things. It is not just an abstract notion of some idea. Conceptual is an adjective. The key here for modeling is that Concepts need to be clearly defined so we know what instances are included and which are excluded from the population.

2. The notion of Property is derived (often called an "attribute" in data modeling). Here is my definition: An Attribute is a Concept (or Object) playing a Role in a Relationship with some (other) Concept. As an example, we can have the Concept of a Date, define the population of dates, and perhaps the format of its lexical representation. Now with the Concept of Employee, we could have the Concept of Birthdate (or HireDate or...). That would necessitate a Relationship between the two Concepts. In that Relationship, Birthdate is a Role name for Date. We could also have a predicate phrase to name the relationship, such as "is born on" (we could also have an inverse reading). The Relationship between Concepts comes first, before we can speak of Properties. Properties can exist all by themselves without having a relationship with a Concept, precisely because it is first, a Concept. In fact modeling circles, some people speak of an "attributeless" modeling scheme.

3. Recognize that Relationships can be more than binary. Nijssen's first paper (1976) was on Binary Modeling. It later became Fact modeling, recognizing that Facts could be unary, ternary, or more.

4. Saying that a common ingredient is the Triple, restricts us to binary relationships. In Halpin's ORM a predicate can be of any arbitrary arity. Higher order relationships are represented by objectifying a predicate. This is equivalent to having a separate table in a relational data model to represent even a many-to-many binary relationship. This stems from the first normal form restriction which says that an "attribute" can be at most single valued. That means it is possible to directly represent at most a 1:Many relationship. Since this restriction is already a step toward implementation in a relational database, I argue that it has no place in a business concept "data" model. People don't have difficulty comprehending a M:N or even a ternary relationship, even if our relational data management systems do!

5. For the occasional elements I call them all constraints, or more generally business rules. This includes mandatory/optional and exclusive/multiple (what you call cardinality), and so much more, particularly conditional constraints (which is not even possible in even the more advanced DBMSs). In fact, our overarching goal in concept modeling is to capture and formally express as rich a set of semantics as possible about the user domain. Lacking expressible semantics in our models we are doomed to data quality issues. One of the best examples of rich data semantics capture is in Halpin's ORM flavor of fact modeling.

2020-09-08

Modeling a Many-to-Many reflexive relationship

Response to John Sullivan post on LinkedIn, 2020 Sept 3.

His paper: https://www.linkedin.com/pulse/better-database-design-part-1-john-sullivan/

shows an entity labeled Organization Structure having two defined relationships with an Organization entity. The relationship arcs are labeled Parent Organization and Child Organization. In the descriptive text he states that the Organization Structure entity has two fields: Parent Org ID, and Child Org ID. Each field in a foreign key to its respective Organization entry, and together they form a composite key for the Organization Structure entity.

Parent Organization

Organization

Structure

>O-----------------------|-

Child Organization

Everest responds:

John, I hate to say it but you have fallen into the TABLE THINK trap, as has everyone else who tries to model Organizational Structure this way. You tell me how many business users will understand what you have called the "Organization Structure" entity in the diagram. You are right to say they should be greyed out. In fact, they should not be there at all. They are there solely for the purpose of implementation (in a relational DBMS). How many times do we say, "the logical model should be independent, i.e., show nothing related to implementation or physical storage? The reality is that this representation is forced when you have to implement in 1NF relations (tables) – a relational model can only directly represent at most a 1:Many binary relationship. That is because, in a relational database, it is not possible to directly represent a many-to-many relationship – all attributes must be at most single valued. The proper logical diagrammatic representation would be an arc representing a relationship on the Organization entity to itself with a fork (for manyness) at each end of the arc. The labels on the arc are role names for each Organization as they participate in the relationship.

Organization

Parent

>-----------

v |

|___Child_______|

Most people will have no difficulty understanding the notion of a many-to-many relationship, once they understand the meaning of the fork (already used in IE notation). Then we need to label the ends of the arc with the ROLE each instance plays in the relationship, either parent or child. We also need to apply some constraint declarations to this relationship. For example, we probably don't want to allow an organization to be both parent and child in the same relationship instance. We may also want to exclude a circular chain of parent-child relationships. As an aside, let me say that all of this is handled quite nicely and precisely in Halpin's Object Role Modeling (ORM) scheme, a variant of fact modeling, and the focus of my years of teaching advanced data modeling at the University of Minnesota.

Everest on Data Modeling