Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2018-05-15

Data is like the corn flakes

In our information systems and IT shops, computers, communication networks, programs, systems, programmers, indeed, all IT/IS personnel, are merely "factors of production." We don't "eat" the factors of production of corn flakes - the bins which store the corn and other ingredients, machines which roll out the corn, which mix the ingredients, and fill the boxes, and the belts and pipes that move the product in the warehouse, and even the warehouse. No, we eat the corn flakes, just like we "eat" the data. It is the output from our computer information systems which is of value.

2018-05-14

Data must be managed as a asset

by Martha Lemoine, on the Data-centric manifesto.

Everest responds:

Yes, we often say this but what does it really mean. Look at the definition of Asset in accounting. Organizations spend money. We call these expenditures. The general assumption is that it benefits the current accounting period (year) because it is basically helping to generate revenue for the same period in which it is spent. So we then call it an expense. However, if it is a relatively large expenditure which benefits future years, say a new building or piece of machinery, then accountants want to capitalize the expenditure and spread it as an expense (depreciation) over multiple years, offsetting future revenue it helped to generate. A capitalized expenditure is called an asset. Capitalizing an expenditure, pushes an expense into the future, thus increasing revenues for the current accounting period or year. .. Now in the case of data, does it benefit future time periods, i.e., help to generate revenue in the future? Of course. So what expenditures are made to collect, store, process, retrieve, maintain, update, etc.? Should they be capitalized? That is the hard part. Nobody does, except maybe a firm whose business is selling data, but most of the time they don't. So we must be careful calling data an "asset." Nevertheless, we can certainly call it "something of value" which, as you say, must be managed as an asset.

2018-03-12

Data Modeling and NoSQL


To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.
  The main purpose of data modeling is to understand the business, some application domain, some user world.  The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships.  A secondary purpose is to build a database to contain information which pertains to and describes that domain.  Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system.  That could be a relational DBMS, or a NoSQL tool.  Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model.  However, increasingly the data already exists in some form.  Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain.  NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!).   Either way, you must understand the business in order to make sense of the data. 
BOTTOM LINE:  it is vitally important to understand the principles, methods, and approaches of data modeling which can help you get to an understanding of the data and the business domain it represents.  Whether using traditional database management tools or NoSQL tools you still need to understand the data.

2018-01-02

Data modeling course in a nutshell

A former student asks:

..    I took your class last spring and am currently in talks with some companies for a junior database position.  I find it a little challenging when it comes to talking about exactly what I learned in your class as we covered a lot of much valuable information. If you have any input for a quick way to describe all of the information I learned that would be very helpful.

Concerning Data Model Madness: what are we talking about?

Martjin posts (2012/10/11)

..   There have been endless debates on how to name, identify, relate and transform the various kinds of Data Models we can, should, would, must have in the process of designing, developing and managing information systems (like Data warehouses). We talk about conceptual, logical and physical data models, usually in the context of certain tools, platforms, frameworks or methodologies. A confusion of tongues is usually the end result. Recently David Hay has made an interesting video (Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this issue in a consistent and complete manner. But on LinkedIn this was already questioned if this was a final or universal way of looking at such models.
..   One of the caveats is that a 'model' needs operators as well as data containers. Only the relational model (and some Fact oriented modeling techniques, which support conceptual queries) defines operators in a consistent manner. In this respect Entity Relationship techniques are formally diagramming techniques. Disregarding this distinction for now we need to ask ourselves if there are universal rules for kinds of Data Models and their place within a model generation/transformation strategy.
..   I note that if you use e.g. FCO‑IM diagrams you can go directly from conceptual to "physical" DBMS schema if you want to, even special ones like Data Vault or Dimensional Modeling. I also want to remark that there are 'formal language' modeling techniques like Gellish that defy David's classification scheme. They are both ontological and fact driven and could in theory go from ontological to physical in one step without conceptual or logical layer (while still be consistent and complete btw, so no special assumptions except a transformation strategy). The question arises how many data model layers we want or need, and what each layer should solve for us. There is tension between minimizing the amount of layers while at the same time not overloading a layer/model with too much semantics and constructs which hampers its usability.
..   For me this is governed by the concept of concerns, pieces of interest that we try to describe with a specific kind of data model. These can be linguistic concerns like verbalization and translation, semantic concerns like identification, definition and ontology, Data quality constraints like uniqueness or implementation and optimization concerns like physical model transformation (e.g. Data Vault, Dimensional Modeling), partitioning and indexing. Modeling/diagramming techniques and methods usually have primary concerns that they want to solve in a consistent way, and secondary concerns they can model, but not deem important (but that are usually important concerns at another level/layer!). What makes this even more difficult is that within certain kinds of data models there is also the tension between notation, approach and theory (N.A.T. principle). E.g. the relational model is theoretically sound, but the formal notation of nested sets isn't visually helpful. ER diagramming looks good but there is little theoretic foundations beneath it.
..   I personally think we should try to rationalize the use of data model layers, driven by concerns, instead of standardizing on a basic 3 level approach of conceptual, logical, and physical. We should be explicit on the concerns we want to tackle in each layer instead of using generic suggestive layer names.
I would propose the following (data) model layers minimization rule:

A layered (data) modeling scenario supports the concept of separation of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers using a minimum of modeling methodologies and notations.

Physical data model governed by business (logical) data model

Jaunine Conradie (student) asks: (email to me on 2012 Aug 13)
    The typical Teradata architecture includes a staging layer (stages data from sources in the form that they exist on the source), an integration layer (3NF integrated data from stage), and a semantic layer (dimensional view on the integration layer data).
.. We are two modelers on the project. The problem that we have is that the ETL lead is demanding a model change to facilitate data loads and queries.
.. In short, a Subscriber on the network (Vodacom here in South Africa, which is like Sprint or Verizon in the USA) have various Access Methods of various Access Method Types.  Types can for example be Primary (cell phone number), Data (for internet access), etc.
The change being requested by the ETL lead is to host the Primary access method on Subscriber.

This is problematic because:
1. This is inconsistent with the implementation, where some access methods for a subscriber will now be carried on the Access Method Subscriber entity, and some on the Subscriber entity itself.
2. We do not (and should not) carry history on the Subscriber entity.
3. Teradata has specialized index types to facilitate joins (this negating the request to denormalize because of query performance)


How do we convince the program manager to not let ETL‑related considerations dictate the design of the physical model?

Definitions of concepts should start with their supertype. Why not?

Andries van Renssen -posts on LinkedIn. 
A good definition of a concept should build on the definition of more generalized concepts. This can be done by two parts of a definition:
1. Specifying that the concept is a subtype of some other more generalized concept.
2. Specifying in what respect the concept deviates from other subtypes of the same supertype concept.

Such kind of definitions have great advantages, such as:
‑ The definition of (and the knowledge about) the supertype concept does not need to be repeated, as it is also applicable ("by inheritance") for the defined (subtype) concept.
‑ The definition implies a taxonomy structure that helps to find related concepts that are also subtypes of the same supertype (the 'sister‑concepts'). Comparison with the 'sister‑concepts' greatly helps to refine the definitions. Furthermore, the definitions prepare for creating an explicit taxonomy.
‑ Finding the criteria that distinguishes the concept from its 'sister concepts' prepares for explicit modeling of the definitions, which prepares for computerized interpretation and the growth towards a common Formal Language.
--Thus, why not?

Can we argue more about business rules?


Posted by Kamran Ayub - Thursday, May 6, 2010, 06:24 PM

I think it's interesting; it's too bad we waited so long to argue about it.  Let me say first off: I understand both arguments. I have my own opinions, but I can definitely see both sides to this.  First, I think we should establish some FACTS that we can all agree on:
1.                   No DBMS currently available (for at least production-oriented apps) can completely express every business rule we need to express innately. (Src: experience and Everest)
2.                   No matter which way you slice it, business rules eventually get turned into code somewhere whether it's at the DBMS or in application logic. At some point, there's a piece of machine language where the rule is expressed and validated in 1s and 0s.


There, now that we agree on those two points, what exactly are we arguing about?  If you believe Fact 1, then it follows that it is not possible to ever express all our business rules in the DBMS (at this current state in time).
Here's my view:  All DBMS do enforce a limited set of business rules. Thus, if you argue that your application code holds all business rules in one central place, you are wrong because the DBMS enforces a few of those constraints (mandatoriness/uniqueness/etc.). Most of us end up duplicating these rules to avoid catching exceptions, without thinking about it.  Secondly, you cannot guarantee that there is only a single point of entry to your data. Don't argue with me; there is a sysadmin account somewhere on the system the data is stored and at least that account has full authority to change the data. How many of us input sample data into our SQL databases to fix a data error, sample data, or test triggers? Exactly. Besides, let's not forget you had to define that structure somehow! It may be true your business users can't access that data directly, but somebody is the data steward and she has full control, rules or no rules. By the way, what about unauthorized access? Whoops, guess that hacker didn't care about all your application-enforced business rules! It's fine if you argue that the hacker could completely destroy your system but you must agree that if your logic is stored in the application, at some point due to the nature of TCP/IP that your data is out in the open between your app and your database. That could potentially be tampered with, and your application can't do anything about it.  
So now, the argument is not whether it's better to store business rules in application logic, the question is why can't we put those rules in the DBMS itself, innately? The answer to that is simply because no DBMS allows that just yet.  The reason I think people argue for storing rules in the application is simply because there's really no other better alternative right now. If a .NET developer is comfortable with .NET and not with T-SQL, where do you think she'll put her business rules? In the .NET application or some other application gateway (not the DBMS). This is what I heard today in our discussion. We put all our rules in application logic because it's the best central repository we have... when really the best place should be the DBMS. After all, what have we been learning all along in this class? ORM is able to capture many of these rules... and yet, not all. Business rules are there to establish integrity with our data so they should be in the DBMS.
I believe it's useless to argue for or against storing rules in application logic because right now it's impossible to put them all in the DBMS. Even if you wrote stored procs or triggers for everything, that's not enforced by the DBMS innately (triggers are closer to automatic) plus the rules are still in code.  Furthermore, I believe most people put their rules in application code because it's just harder to do it in the DBMS and that many enterprise or even regular applications can target different storage systems (one application could use XML, MSSQL, MySQL, DB2, or MS Access as repositories). Imagine duplicating all your business rules across those systems... it's much easier to store the rules in the code, isn't it?  Let me be clear: I am not advocating that it is right to store business rules in app logic. I also believe that it is best to put business rules in the DBMS. All I am saying is, what choice do we have at the moment? Until a better DBMS is created, preferably using ORM as the tool to create the DB then we're SOL. Let the discussion begin!

EVEREST RESPONDS:

.. I love it.  I agree with all you have said. I just would not characterize it as "two sides of an argument." Both "sides" must spring from an environment of understanding of the basic concepts. There are two main traps we fall into: 

1. Believing that someday we can reach the ideal state of a perfect world.  Your discourse is filled with those thoughts, e.g., "no DBMS currently available..." (implying that one may be in the future), "at this current state in time" (suggesting that there may come a time), "no DBMS allows that just yet" (but perhaps a future DBMS product might?), "right now it is impossible" (with the implication that in the future it will be), "until a better DBMS is created" (better yes, but never perfect or complete).  "Man" (which includes woman) is imperfect, and therefore, so are his constructions, our DBMSs, our databases, our code, our systems. So try as we might to improve (and we should), we must always be alert to imperfection, weaknesses, failures, attacks, (whether inadvertent or deliberate, intentional) and build in countermeasures.  That is why we back up our data, build in redundancy, and why every government (every locus of power) must have "her majesty's loyal opposition" and the watchful eye of citizens and the media.

2. We tend to use, to favor that which is more familiar to us. If we are a programmer and that is what we know how to do, we will get things done by writing code in our favorite programming language. If we know how to use the full functionality of some DBMS, then that is what we will do as much as we can.  Some application coders need to be stretched to consider an alternative way of doing things that might get us closer to the ideal of serving people (management, organizations) AND incorporating appropriate safeguards (which also will never be perfect). 

.. The business rules and integrity constraints are those mechanisms that can lead to higher quality data, greater user confidence (in our databases, our systems, our IT departments, and our organizations), better decisions, and more effective and efficient operation. As you rightly state, ALL that happens in a computer system is done by executing code. But that is only half the story.  ALL that executing code is driven by specifications, definitions, declarations, input parameters, in other words "data" more broadly interpreted. Even the program code is the specification of a process. The accuracy and completeness of the process depends on the specifications, the directives, etc. 

.. As I show in my lecture on Business Rules (slides BRules.23 and .27), there is a hierarchy of levels where this information, the business rules, can be expressed:
1. DBMS
2. Stored Procedures, whose execution is under the control of the DBMS.
3. Code (methods) embedded (encapsulated) in "objects" which execute in a well-managed and controlled environment (OO, SOA, SaaS...)
4. User application code
5. External user actions (manual or sysadmin)
.. At all these levels the processes are driven by definitions, specifications.  The efficacy of those processes depends on those specifications, and we (humans) understand those processes through documentation which itself is usually quite imperfect.

.. Moving down these levels we encounter decreasing levels of control and increased dispersion. I put DBMS at the top, because that is the closest to the data and the most likely centralized point of control (least dispersion). Our objective is (or should be) to move these boundaries down so that work gets pushed up into more controlled environments. Another thing that happens as we move through these levels is increasing abstraction.  When we declare database integrity constraints, that is a specification (abstraction) of some underlying process (executed by code... somewhere). We (humans) do not need to know what goes on under the covers, we merely need to have confidence that it will be done correctly and consistently, as reflected in the definition or specification.  For example, we may understand the meaning of the square root operator, but how it gets executed on a computer is quite complex. I want to be hidden from that complexity.  When we write a program we call all sorts of subroutines or functions, which are defined at a pretty high level (of abstraction).

.. To the comment of "too bad we waited so long": The focus of my course is data modeling. The subject of a companion course would be DBMS and Database Administration. This issue we are discussing would properly belong in the second course.  However, I raise the question here because the success of maintaining the integrity of our databases according to our defined semantics depends upon the execution environment, the DBMS ++.

.. Let the discussion continue.