Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2018-01-02

Definitions of concepts should start with their supertype. Why not?

Andries van Renssen -posts on LinkedIn. 
A good definition of a concept should build on the definition of more generalized concepts. This can be done by two parts of a definition:
1. Specifying that the concept is a subtype of some other more generalized concept.
2. Specifying in what respect the concept deviates from other subtypes of the same supertype concept.

Such kind of definitions have great advantages, such as:
‑ The definition of (and the knowledge about) the supertype concept does not need to be repeated, as it is also applicable ("by inheritance") for the defined (subtype) concept.
‑ The definition implies a taxonomy structure that helps to find related concepts that are also subtypes of the same supertype (the 'sister‑concepts'). Comparison with the 'sister‑concepts' greatly helps to refine the definitions. Furthermore, the definitions prepare for creating an explicit taxonomy.
‑ Finding the criteria that distinguishes the concept from its 'sister concepts' prepares for explicit modeling of the definitions, which prepares for computerized interpretation and the growth towards a common Formal Language.
--Thus, why not?

18 comments:

  1. Axel T. posts:
    Your approach is particularly helpful when defining master data entities in the domain of Party which usually has many roles (non exclusive "or") such as Customer, Supplier, Employee etc.. Since these roles usually come with additional specific attributes, they will constitute real subtypes.

    ReplyDelete
  2. Martijn E. posts:
    ..big gap between ontological oriented models and practical data models is the fact that in practice identification is always on kind of role, and never on the object itself. The basic objects have practically no reliable and cost effective way of identification. This is why I am hesitant to create a 'person' table, but will readily create an 'employee' table.

    ReplyDelete
  3. EVEREST RESPONDS:
    Wow, this conversation got off track fast – from ontologies to identifiers! Andries was first speaking of ontologies. In developing an ontology, we are defining concepts and relationships among concepts. In data modeling we are defining populations of objects. These are not the same. I assume that Andries was referring to concepts in an ontology. However, Axel T. immediately started speaking of data entities. Martijn then jumped to speaking of data models which are always about populations of objects, not the definition of concepts. In ontologies, I don‘t see any obvious connection to populations of things in the user world. Andries, were you intending to talk about data modeling?
    ... In data modeling, to distinguish things from roles of things is precisely why we need subtypes and supertypes. That is the only way we can model overlapping populations. An underlying assumption in data modeling (ER, Relational, etc.) is that all type populations are mutually exclusive. When they are not we need some other modeling construct, and that is what S/Stypes are all about. The distinction between the role Employee and the more general type Person is why we need subtypes and supertypes in data modeling. When the same person can be both an Employee and a Shareholder and a Customer there are some attributes (data) which are common. Without S/Stypes, the three separate object populations leads to redundancy, and therefore, the possibility of inconsistency. The solution is to formally define a supertype Person to “contain" the common attributes, and subtypes for each of the roles and the additional attributes for the role.
    ... There is a subtype/supertype hierarchy (or more precisely a partial lattice, which allows for multiple supertypes on a subtype). As we move up the hierarchy to more general or inclusive populations, we lose business vocabulary – Employee is more understandable than Person, perhaps. At the very top of the hierarchy is a single object population which includes ALL object (instances) of all types. It is the classic “Thing" both concrete, abstract, and imagined. In the relational world this is called the “Universal Relation" – a useful theoretical concept but not too practical.
    ... Now to identification and identifiers. I don‘t see how this has any connection to ontologies where we need a vocabulary to talk about concepts. In data modeling, we must first specify the population of objects, before we worry about identification of individual members, i.e., object instances. Identification means finding a reliable (usually lexical) surrogate (identifier, key) we can use to identify instances in the object population. Any discussion of subtypes/supertypes should (initially) be free of any discussion of identification. I know who David Hay is within my context but that is merely a string of characters which serves as a surrogate for the particular human body. I could use a photograph. Within his family, David alone is probably a sufficient identifier. In a broader context, I may need more information or an artificially assigned token to uniquely identify the David I know, perhaps his DNA (however that is designated).

    ReplyDelete
  4. Andries van Renssen responded with comments on: data modeling vs. ontology; roles; and identification vs. recognition.

    EVEREST responds:
    (Interesting. I got my start in IT working for Shell E&P in Calgary in the early 1960s).
    Andries, help me understand. I too once thought that ontology and data model were synonymous. In an ontology are you attempting to define terms and how terms are related to one another, or are you trying to describe some aspect of the “world" which is of interest to some community of users? Is there any notion of populations of things in an ontology? In both cases you are working with a vocabulary (terms, or names of populations of things). I agree that when we try to merge the vocabularies from two (different but likely overlapping) domains or contexts it is exceedingly difficult to find/define a mapping between the two vocabularies.
    .. In data modeling, we have names for types of things. The grouping of things into populations is something the observer/designer imposes on their view of the world. The only real reality is the collection of individual things, initially unnamed, untyped, and unidentified – none the less real in their existence. The designer groups individual things into populations, assigns a type name to the group, and then figures out how to uniquely identify individual members of each group. Additionally, the designer defines the relationships between/among the populations of things, and the constraints on those things and relationships (business rules).
    .. I like your notion that a data model is/can be expressed as an “artificial language" in a particular context or domain. This is precisely the basis for Object Role Modeling (ORM) – the types of things are the nouns, relationships are the verbs, and the constraints capture the characteristics on those things, relationships, compositions, etc. These constitute the vocabulary for the language. To this we add the grammar (rules of composition for statements in the language).

    ReplyDelete
  5. EVEREST ON ROLES: I agree that a role is something one type of thing plays in a relationship with another (perhaps the same) type of thing. It was misleading for me to suggest that roles are subtypes. What I meant to say was that subtyping is one way to represent the notion of roles. Even left tires can be modeled as a subtype of tire. All subtyping is based upon one or more common characteristics of members of the supertype population (and if I don‘t have a characteristic, I can make one up – the extensional set as opposed to an intentional set). If one of those characteristics is “side of car" then that can be the basis for calling out a subtype of tire. If one of the characteristics of person is role(s) played, then I can model the roles of Employee, Student, and Customer as subtypes of Person. If it is a multivalued characteristic (i.e., a person can play multiple roles) then the subtypes will be overlapping populations.
    .. Let me further say that attribute (property, characteristic) is not a primary modeling construct. The primary modeling constructs are only things (in populations), and relationships. An object has an attribute by virtue of playing a role in a relationship with another object.

    ReplyDelete
  6. EVEREST ON IDENTIFICATION: I was not confusing recognition with (unique) identification. There is no such thing as a “unique identifier" per se, only other things. Things in populations have no intrinsic unique identifier by themselves; they only exist in the population. That is what I meant when I said the human body we know as David Hay. Now, the only way we can talk about the things or concepts in our world domain of discourse is to attach labels or names or identifiers to them, to serve as surrogates for those things. Only when we have a vocabulary can we have a language. There are many possible object populations we can use to reference or identify members of some target population. All of them are either in popular usage, designed by some modeler, assigned by the system, or anything else. So I could have David (first name), David Hay (family name), social security number, system assigned surrogate key, DNA key, some combination of other related object types (attributes), or anything else. Choosing a good identifier is a subject in itself. In another forum I spelled out some 9 criteria for selecting a unique identifier. The important thing is that we want to pick a population of names or identifiers which has a 1:1 mutually dependent relationship with the members of the target population, guaranteed now and into the future.

    ReplyDelete
  7. Andries van Renssen · About multiple supertypes Concepts can be specialized according to more than one 'discriminator'. For example, pumps are conventionally specialized according to their orientation (horizontal and vertical (axis)) and also according to their operating principle (e.g. centrifugal and reciprocating). Thus we have four subtypes: horizontal pump, vertical pump, centrifugal pump and reciprocating pump.
    However, we have also further subtypes, such as horizontal centrifugal pump, vertical centrifugal pump, horizontal reciprocating pump and vertical reciprocating pump.
    The latter all have two supertypes. Both are fundamental according to your definition.
    I don't think that there is a method to avoid those two supertypes, except when you state that one of the above kinds of pumps or relations are forbidden. But from an engineering perspective that is not acceptable. Those kinds of things simply exist.

    It is possible to avoid the use of the last four subtypes, because you can classify individual things twice: P1 is classified as a horizontal pump AND P1 is classified as a centrifugal pump.
    However, there is no reason to state that it is forbidden to say that P1 is classified as a horizontal centrifugal pump.

    I know that multiple supertypes and multiple classifications are difficult to implement in some technologies, but that is a strange constraint, because semantically there is nothing wrong with it.

    ReplyDelete
  8. David Hay : Actually, your "pump" example is a good example of the argument for the "Characteristic" model. I would define a "pump" as "a Piece of Equipment that is used to move fluids from one place to another". The supertype is "Piece of Equipment". That it may be "horizontal", "circulating", etc. is a categorization. (This is a special case of my use of "Characteristic". A particular instance of "Pump" may have a "Categorization" into "Vertical", and another "Categorization" into "Centrifugal". If it has both, then it is, by definition, a "Vertical Centrifugal Pump".

    Your discussion makes it very clear that the sub type/super type structure is a very impractical way to present this kind of configuration.

    Inherently, a Pump is a piece of equipment as is a Compressor, a Power Saw, etc. Having said that, you can categorize it many different ways.

    ReplyDelete
  9. Andries van Renssen posts · David, in order to make clear that subtyping is a very practical way, my discussion should be continued as follows: (your definition of pump is O.K.)
    The subtypes are defined by the characteristic values. Thus for example, the definition model of the concept 'centrifugal pump' is: a centrifugal pump is a subtype of pump that has an operating principle that is 'centrifugal'. So we use both, a characteristic and a subtype that is defined by the characteristic value 'centrifugal'.
    Once we have defined these subtypes, we need only one classification statement that states: P1 is classified as a 'horizontal centrifugal pump'.
    This will cause inheritance of all knowledge about horizontal pump as well as centrifugal pump, including the definition of horizontal and centrifugal (because that belongs to the definitions of the supertypes.


    However, if we would have only characteristics then the modeling of individual things becomes very unpractical, because our P1 should be classified as a pump AND should have two aspects, AND one aspect shall be qualified as 'horizontal', AND the other aspect shall be qualified as 'centrifugal'. Then we still don't know what that implies that it is classified as a 'horizontal centrifugal pump' (because that concept is undefined). And it's even worse, because we then still do not inherit any knowledge about a centrifugal pump, nor about a horizontal pump, but only about 'centrifugal' and 'horizontal'.

    ReplyDelete
  10. EVEREST RESPONDS:
    When thinking about subtypes and supertypes, I think in terms of populations. Each subtype and each supertype has a population. First there is a population of pumps, then there is a subtype population of horizontal pumps and another subtype population of centrifugal pumps. In this scenario, there are only four subtypes, not eight as Andries suggests. He falls into this trap assuming that subtype populations must be exclusive. This would be a limitation of the modeling scheme or tool used. In general, subtype populations can always overlap, and if any two don‘t, we must define an exclusive constraint across them. Indeed, it would be possible to build a model with two intervening supertypes but that is artificial and unnecessary. An important modeling principle is that we try to model the world as directly as possible, with no spurious constructs (as with an intersection entity for an M:N relationship).

    .. To David, I would disagree that Sub/supertype construct is impractical. To me it is precisely the correct way to construct the model in this case. First a principle: every subtype population must be (potentially) a subset of each of its supertype populations (yes I allow multiple supertypes). Secondly, if there is nothing special we want to do or represent about a subtype population, then why define it at all? The subtype represents a specialization of the supertype population based upon some criteria which could be a single discriminator (characteristic variable?) or a complex Boolean expression involving multiple discriminators or properties. Perhaps thinking in terms of populations creates a simplification, but that is what we do in data modeling. Perhaps not when developing an ontology of concepts. I agree with Andries.

    .. To Tom: Aristotelian we may be, but as you say Wittgenstein states, that the criterion may be either A or B or both, whereas A and B may overlap and subtypes may have multiple supertypes. As I said above, the discriminator can be a complex Boolean expression. I would disagree that the subtype/supertype “hierarchy"(which it isn‘t if you allow multiple supertypes) must be exclusive (on the subtypes) and exhaustive (on the supertype) at each level. It is perfectly acceptable to call out a subset of a supertype population where we want to do something special with the members of the subtype. In the world we are modeling, the general case is overlapping subtypes and non-exhaustive on the supertype. Exclusive and exhaustive would be declared as constraints on the more general case. These notions of sub/supertypes are not inconsistent with Codd, they just go beyond what he was defining in the relational modeling scheme.

    ReplyDelete
  11. Tom Johnston posts (2012/10/19): Hi Andries. In brief: if the immediate subnodes for any node are not mutually exclusive, then things can be double counted (because double classified). If the immediate subnodes for any node are not jointly exhaustive, then things can go uncounted (because unclassified). So, in my opinion, the price for allowing exceptions is very high.

    ReplyDelete
  12. Gordon Everest posts · Tom, I do not think the purpose of subtype/supertype constructs is to model a taxonomy i.e., set up a classification scheme in which every sibling level partitions the parent population into mutually exclusive and collectively exhaustive subsets. The purpose of a subtype is to call out a subset of the supertype population for special treatment, such as additional attributes or relationships. If there are no other members requiring special treatment, then there is no need to call them out (since the attributes of the subtype will be exactly the same as for the supertype), hence not all members of the supertype need be in a subtype. If a member of the supertype requires multiple, different special treatments (and that is important to the enterprise being modeled), then it needs to be in multiple subtype populations. Your assumption that subtype/supertypes must set up a classification scheme, I believe is incorrect. One further consequence of your assumption is that you can never have multiple supertypes and that is, in general, an unnecessary restriction. Model a classification scheme using subtype/supertype constructs but don't say that it is the only way they can be used.

    ReplyDelete
  13. David Hay posts · Gordon, I wish to disagree with the premise of your last post: The purpose of subtype/supertype constructs IS to model a taxonomy i.e., set up a classification scheme in which every sibling level partitions the parent population into mutually exclusive and collectively exhaustive subsets.
    Each entity type is a set of things (either tangible or abstract) that share a set of fundamental characteristics. They may also have other, "accidental" characteristics, but these are not part of the definition.

    ReplyDelete
  14. Gordon Everest ·(2012/10/19): Tom, I agree with David, your example is a special case. However, it can still be modeled in the more general case of the subtype/supertype construct. The general case is to allow non exhaustive on a sibling set, and to allow overlapping subtypes. This is the least restrictive case and should be our default. If we want to apply some restrictions, which I call constraints (and so do many others), we can do so but it need not be for all cases. There are three basic types of S/Stype constraints:
    (1) Dependency of the supertype on the subtypes, that is, each member of the supertype population must be in AT LEAST ONE of its subtypes, or we can say the supertype population is exhausted in this relationship.
    (2) Each member of the supertype population can be in AT MOST ONE of its subtypes, the exclusive constraint. (Note the similarity to the notions of dependency/optionality and multiplicity/exclusivity in characterizing relationships.
    (3) The third basic constraint can be applied when we have multiple supertypes on a subtype. IF a member is in BOTH (SOME, ALL) of its supertype populations, then it MUST be in the common subtype population. Without the constraint, there could be members in multiple supertypes which are not in the common subtype. The effect of this constraint is that the subtype must be the full intersection of the populations of the supertypes.
    ... In your financial example, you wish to have each member of the supertype be in EXACTLY ONE of its subtypes, applying both constraints. That is fine for your example, but don't impose that rule on all subtype/supertype relationships. Remember why we are using this construct at all to formally represent generalization and specialization in our populations of things. Generalization (finding commonalities) because it is more efficient in our thinking and in our representations. Specialization, because we want to call out a subset of some population for special treatment.
    .. Let me also add that these subtypes and supertypes can be intermixed, even on the same sibling set. For example, we could define exhaustive constraint on just some of the subtypes for a supertype. Also, we could define exclusive across just some of the subtype populations. Try to represent that in something other than an Object Role Model (ORM or FOM)!

    ReplyDelete
  15. Gordon Everest · (2012/10/19)... I guess with the above discourse, I have responded to David's post as well. It argues why I disagree that the purpose of Sub/Super type IS to model a taxonomy or a classification scheme. That in itself would be a special case. Perhaps the difference is that I am always thinking in terms of populations of things rather than definitions of terms/concepts (and some have suggested that they are the same... I don't know!). Data modeling is fundamentally about replication. The reality is that we have individual instances of things in the world. The modeler observes commonalities and thus clusters individuals into populations of things, which to some level, can have a common definition (e.g., a list of relevant attributes). That is the first step in generalization. In those populations, there is a replication of some template, pattern, or schema which applies across the members included. Replication in data is analogous to repetition in processes. The real benefit is that I can repeat the execution of a process over the replicated members of a population. That is one of the powers of computers.

    ... In my view, it is precisely those "accidental characteristics" that we want to capture and formally represent in our models (assuming they are of interest to those in the user environment). That is why we need the notion of subtype for special treatments.
    .. As for my being unduly influenced by OO (or did you mean ORM?), I have specifically distanced myself from OO. Their notion of subtypes (subclasses) and inheritance are quite different from those notions as used in data modeling. And as David points out, in pure OO there is no notion of a class (i.e., population) all objects (instances) are derived from (subtype, inheritance) some other object (instance). This difference is born out in the notions of priorities and blocking in OO inheritance. In data modeling, if you are a member of a subtype, then you MUST be a member of the supertype(s) AND you must inherit ALL the properties of the supertype no picking and choosing as in OO. There is a lot of confusion on this point in the OO community. In OO the central focus is on efficiency of construction through reuse (inheritance), whereas in Data modeling the focus is on modeling populations of things in the users world, and with Sub/Supertypes, we can formally model overlapping populations of things (in contrast to Relational or ER which always assumes that the entity populations are mutually exclusive).

    ReplyDelete
  16. Tom posts: I'm saddened to hear from Gordon that some OO guys think an object can be a member of a subtype without being a member of its supertype (and recursively so). One week in a freshman class in set theory would disabuse anyone of that silly notion.

    ReplyDelete
  17. EVEREST RESPONDS:
    Tom, I do understand set theory quite well. Let me clarify. I did not say "an object can be a member of a subtype without being a member of its supertype." That is impossible in any subtype/supertype relationship. It violates the fundamental definition. So we can agree that IF you are a member of a subtype, THEN you must be a member of all of your supertypes. What I did say was that, in the data modeling world, IF you are a member of a subtype population THEN you must also inherit ALL of the characteristics of the supertype (s), because in both sets, they are the exact same instance. If you have blue eyes in the subtype, you cannot have brown eyes in the supertype. That is not the case in the OO world. There, a member of the subtype/subclass can "block" the inheritance of an attribute, state, method, relationship of the supertype. That is because OO is not based on the notion of populations, but on the notion of objects constructed from objects, not necessarily representing the same instance in the world being modeled. In OO the purpose is reuse through inheritance, in data modeling it is modeling overlapping populations in which an instance of something can be in more than one population. e.g., a person (instance) can be an employee, a shareholder and a customer at the same time. In our database we most likely have separate tables for those three subtypes because they have widely differing attributes and it would be inefficient to put them all into the same record type. However, we do note that there is some redundancy across the subtypes person name, address, phone, height, weight, and color of eyes. Generalization is to take out the common attributes and put them into a supertype, from which each of the subtypes can inherit these common characteristics. The underlying assumption in all ER and relational modeling is that all entity populations are mutually exclusive. This is a fundamental assumption always made. However, we know that this is not always true. In this example, a person can be two or all three of the subtypes. To model this correctly requires the use of subtype/supertype constructs. Furthermore, it need not be that all sibling sets must be mutually exclusive and collectively exhaustive. When it is true, apply the constraints.

    ReplyDelete
  18. EVEREST further responds:
    Tom, now to another point. You state that
    "(2) If a finite set of nodes is acyclically connected by one to many relationships, it is a hierarchy, having a first member (called the root node) and one or more last members (called leaf nodes)."

    This is another common misunderstanding in set theory or the more general graph theory. What you are forgetting is that the nodes are HOMOGENEOUS in the set (at least at some level). In graph theory the often unstated assumption is that the nodes are homogeneous, at least in the sense that they are identified in the same name space. What you have in (3) is the most general case of a graph (or network?). What you have defined in (2) is a tree, not a hierarchy. A tree is a rooted, connected, acyclic graph of N (finite) nodes and N-1 arcs. A hierarchy, on the other hand, can draw upon nodes from two different populations, and there is a 1-M (hierarchical) relationship between members of the "parent" population and the "dependent" population. As an example, I can have organizational units which can have multiple employees, and an employee must belong to at least one (dependent) and at most one (exclusive), i.e., exactly one organizational unit. But we would never say that employees and organizations are from the same population, hence they cannot be described as a graph. (The exception might be a reflexive hierarchical relationship, but this is still binary and hierarchical because the members at each end of an instance of the relationship are playing different roles. An example might be "boss of" in a hierarchical organization. All the nodes are from the same population, employees, but they participate in the relationship in different roles. In two different instances of the relationship, and individual can be a boss in one and a subordinate in the other.)

    ReplyDelete

Comments to any post are always welcome. I thrive on challenges and it will be more interesting for you.