Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2020-04-25

Are we really modeling data?

Even the title of this blog is a little misleading.  Perhaps I should call it
Everest on "Data" Modeling!
I never did like the phrase "data" modeling.  It suggests that it is a "model of data."  That is misleading to someone outside of our community.  At the heart of it we are not modeling data, we are modeling some user domain. It is only a model of data if we have some data.  Then the model would be an (abstract) "representation" of the data.  For us, as data modelers, a "data" model is a model of some aspects of a domain of interest to a community of users, real (world) or imagined/desired/yet to be built.  It is a model expressed "in data," that is, built using informational constructs, all guided by a modeling scheme.  The modeling scheme tells us what to look for in the domain and how to represent it in the model.  So we identify a population of similar things, give it a label and a definition, and put a box or circle into a diagram to represent that population of things.  We build up or "design" a model with lots of types of things, add relationships among those things, and constraints on those things and relationships.  The modeling scheme tells us how to represent those relationships and constraints in our model.

2020-04-23

Is defining entity/object type populations arbitrary?

I have been saying (see my video lecture on Subtypes/Supertypes on YouTube) that defining populations is essentially "arbitrary", a choice made by the designer in modeling some user domain.  This is to contrast with a common view that there is one correct model and it is the good designer who can find it!

Fabian Pascal commented:  I would not refer to it as "arbitrary", but pragmatic -- to incorporate perceptions and serve application needs of users.

I dislike the word "arbitrary" too. Fabian is absolutely correct. Sometimes it may seem arbitrary to others, but it is actually a choice made by the designer(s) based on the purpose(s) of the model. Since the world presents itself to us as only made up of instances (individual things), the modeler must choose how to define the grouping of things into populations or classes. This is part of the abstraction process in modeling. As an example, one may be interested in animals and a cow would be an instance. But a dairy farmer probably wants to keep information on each individual cow. As another example, a designer may define a population of Employee, but that may include or exclude applicants, furloughed, part-time, contract, retired, etc. depending on your purpose.

.. To arrive at an efficient representation of the user domain, we group individual instances into populations so that we can define and manipulate its members in the same way. Efficiency goes for (1) our own human ability to grasp an understanding of our world, (2) our ability to define relationships, attributes, constraints, etc. which apply uniformly to members of a population, and (3) our processing the representation of that world in our databases.

.. Thanks, Fabian, for calling me out on that one. I am not sure I would use the term "pragmatic" since that implies the designer's choice is matter-of-fact, obvious, cut and dried, unequivocal, or settled. However, there may still be disagreement and alternative designs. Anyone suggest a better word?

Concept Modeling vs. Conceptual Data Modeling

George McGeachie posts (LinkedIn, 2020 April 22)
To Fabian Pascal: Broadly speaking, I agree with your three levels [conceptual, logical, physical], but please don't assume that my view of 'data modelling' is limited to logical database design. "Conceptual" modelling is better described as "Concept" modeling (see the works of and ), so people don't mentally tag the word 'data' on the end (Conceptual Data Modelling). Where do we stop modelling 'Concepts' and start modelling 'Data'? It's probably somewhere in the top level of representation, before Logical Database Design.

To which Everest responds:

Thanks, George. I like that. I never did like the phrase "data" modeling because. at the heart of it we are not modeling data, we are modeling "things" in the user domain. Now I can say we are modeling "concepts" in the user domain. Modeling "things" suggests only nouns, whereas modeling "concepts" can include constraints, "business" rules, modifiers, roles, subsets, etc. Thus, concept modeling can be very rigorous, richer, and detailed (as in ORM). But we can recognize that this is a prelude to modeling "data" as it would ultimately be manifest in a "logical" data model. Perhaps that helps me better understand what Fabian keeps saying when distinguishing "conceptual" from "logical" data modeling... and his "logical" is (his version of) relational/RDM which is already a step toward implementation (I would argue) simply because it is clustering "attributes" into (1NF) relations.

All language is referential: Members of a population vs. identifiers

John O.Gorman posts (LinkedIn 2020 April 22)

Since all language is referential I can declare membership of strings based on their usage in communication. I don't need attributes or properties to do so. For example, the string 'John Smith' looks to me to refer to a Person, so I make an ontological commitment to associate that string with that (Person) class. Doing so accomplishes a couple of things: 1. I can use that class anywhere I might want to reference members of that collection.

Everest responds:

John O'Gorman. "All language is referential" - love it and I agree absolutely. Let's remember that we are modeling/representing things in some user domain. We (the designer/modeler) are the ones who define the groupings into populations. That process may seem arbitrary, but is chosen by the designer based on their purposes, and how they wish to view the world. Such groupings do not naturally occur. The world only consists of instances of things. The designer, of course, uses clues in deciding the rules of membership, and it is usually based on our observed characteristics of individual prospective members of a population and what is of interest to us.
.. The task of the modeler is to design a model to (accurately) represent the user domain. The model is essentially an abstract representation, abstract because we use "tokens" to represent things and populations of things in that domain. The dilemma for us is we must find some way to (uniquely) identify individuals (members of populations). You would not be comfortable being put into a data storage device and spinning around on the surface of a disk at 100 mph! So we need a token to serve as a surrogate for you. We call that an identifier. (Criteria for choosing an identifier is a topic for another discussion). Its form is usually some string of characters.
.. The trap we fall into is thinking that a character string, a particular surrogate token, IS the person.
It is not the string "John Smith" you want to associate with the class (or population), it is the actual person. You need to have confidence that the string uniquely identifies or references the person. The operation of our systems and databases depends on it. This is why we need to have a careful definition of the population of things we use in developing our models.
.. Defining populations and the criteria for inclusion and exclusion (and choosing identifiers) are the critical and difficult tasks of a "data" modeler.

2020-04-21

In a data model: nouns and predicates

John O'Gorman asks (LinkedIn Data Modeling 2020 April)

Why do data models only include nouns? Second, the word 'Status' is a noun, right? If I use it as the name of a set, could I include the words 'Active', 'Inactive', 'Stalled', and 'Inverted' as members of the set? If so, could I include them in a data model as Concepts even though they are clearly not nouns?

Ken Evans answers:

Not true. A proper data model has nouns and predicates that define the relationships between the nouns.

Everest responds:
First, for data modeling, I note that a noun implies a population of "things". Perhaps the hardest but most important part of building a data model is in defining the members of that population so we can always determine what is included and what is excluded from the population.
Not only nouns and predicates (verb phrases) but also adjectives, understanding that an adjective serves to restrict the population of the noun. A noun qualified by an adjective would name a subset of the noun population, e.g., Employee, and full-time Employee.


2020-04-19

Multifactor authentication, with examples.

Generally there are three types of "factors" to consider: Something you HAVE, something you KNOW, and something you ARE (unique personal characteristics). It is possible to buy a safe in which you need to HAVE a key, and you need to KNOW the sequence of numbers to enter using a keypad. Today two factor authentication using computers is generally KNOWing a password and HAVing a mobile phone. The third (something you ARE) is biometrics - fingerprint, iris scan, facial scan, genome, etc. Anyone know what might be considered a fourth factor? For a fuller description and several examples, see my text book, Database Management, McGraw-Hill, 1986, section 14.4 (pages 515-529). As a mini-test (my being a teacher) how would you classify each of the following: handwriting, hand geometry, voice, key (on a key ring), ID card, personal history, combination lock, password, fingerprint. (LinkedIn Data Modeling 2020 April)

Jan Berdnik comments:
Better: ONLY you know and ONLY you have. As with any genuine authorising.

Everest responds:
Better maybe, but can't be guaranteed. That is precisely why multifactor authentication is so much better -- it can be orders of magnitude more secure, than trying so hard to come up with passwords difficult to guess or remember. Some organizations are going too far with the rules for constructing passwords making it a real burden for users.

2020-04-01

Thinking about Attributes or Properties

Kevin Feeney (LinkedIn, Data Modeling, 2020/03/24)
In his presentation on data modeling (https://lnkd.in/dnwYTEY) says that an accurate data model defines Things, Properties of things, how things are Identified, and Relationships.

Everest says:
A caution when thinking about Properties.  You cannot define an Attribute until you first have (or presume) a Relationship.  An Attribute is a thing with a population (a domain of values).  So ORM does not distinguish, it calls them both Objects.  For example, I could have a thing called a Skill Code and Employees have Skills.  That means there is a relationship between Employee and Skill.  We often depict an Attribute being tucked away in a box for the Employee.  This naturally leads to (thinking about) putting it in a column in a table for the Employee entity.  That can lead to problems.  In ORM we defer thinking about tables since that is really a step toward implementation (in a Relational DBMS).  Better to think in terms of two objects, Employee and Skill, with a relationship between them.  So here is the definition:  An ATTRIBUTE (or Property) is an OBJECT which plays a ROLE in a RELATIONSHIP with another OBJECT.  Now we can add cardinality to the relationship.  In fact, in this example, if an Employee can possess multiple Skills there is a M:N relationship and Skill cannot be stored in an Employee table (it would violate First Normal Form).  But Skill is no less an attribute of Employee, even if it is not stored in the Employee table.  That further reinforces the fact that an OBJECT has ATTRIBUTES by virtue of having RELATIONSHIPS with other OBJECTS.  Hence, there is no need for an Attribute artifact in a data model.