Intro
I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.
2018-05-15
Data is like the corn flakes
In our information systems and IT shops, computers, communication networks, programs, systems, programmers, indeed, all IT/IS personnel, are merely "factors of production." We don't "eat" the factors of production of corn flakes - the bins which store the corn and other ingredients, machines which roll out the corn, which mix the ingredients, and fill the boxes, and the belts and pipes that move the product in the warehouse, and even the warehouse. No, we eat the corn flakes, just like we "eat" the data. It is the output from our computer information systems which is of value.
2018-05-14
Data must be managed as a asset
by Martha Lemoine, on the Data-centric manifesto.
Everest responds:
Yes, we often say this but what does it really mean. Look at the definition of Asset in accounting. Organizations spend money. We call these expenditures. The general assumption is that it benefits the current accounting period (year) because it is basically helping to generate revenue for the same period in which it is spent. So we then call it an expense. However, if it is a relatively large expenditure which benefits future years, say a new building or piece of machinery, then accountants want to capitalize the expenditure and spread it as an expense (depreciation) over multiple years, offsetting future revenue it helped to generate. A capitalized expenditure is called an asset. Capitalizing an expenditure, pushes an expense into the future, thus increasing revenues for the current accounting period or year.
.. Now in the case of data, does it benefit future time periods, i.e., help to generate revenue in the future? Of course. So what expenditures are made to collect, store, process, retrieve, maintain, update, etc.? Should they be capitalized? That is the hard part. Nobody does, except maybe a firm whose business is selling data, but most of the time they don't. So we must be careful calling data an "asset." Nevertheless, we can certainly call it "something of value" which, as you say, must be managed as an asset.
2018-03-12
Data Modeling and NoSQL
To
the question “How relevant is data
modeling in the world of NoSQL?” I give the following answer.
The main purpose of data modeling is to
understand the business, some application domain, some user world. The model becomes a representation of that
world -- the "things" in it, the relationships among those things and
any constraints on those things or relationships. A secondary purpose is to build a database to
contain information which pertains to and describes that domain. Hence, the business data model should not be
concerned with issues of physical stored representation, or the
transformations/manipulations/constraints which are imposed to facilitate
implementation in some data (storage) management system. That could be a relational DBMS, or a NoSQL
tool. Generally we speak of the model
coming first, then the implementation, and finally, the data gets collected and
stored according to the model. However,
increasingly the data already exists in some form. Which leaves us with the task of figuring out
what it means, what it represents -- that is, understanding the data as it
represents some user domain. NoSQL tools
are often designed to deal with existing data and to process it more
efficiently (that may be an oversimplification!). Either way, you must understand the business
in order to make sense of the data.
BOTTOM
LINE: it is vitally important to
understand the principles, methods, and approaches of data modeling which can
help you get to an understanding of the data and the business domain it
represents. Whether using traditional database management tools or NoSQL tools you still need to understand the data.
2018-01-02
Data modeling course in a nutshell
A former student asks:
.. I took your class last
spring and am currently in talks with some companies for a junior database
position. I find it a little challenging
when it comes to talking about exactly what I learned in your class as we
covered a lot of much valuable information. If you have any input for a quick
way to describe all of the information I learned that would be very helpful.
Concerning Data Model Madness: what are we talking about?
Martjin posts
(2012/10/11)
.. There have been endless
debates on how to name, identify, relate and transform the various kinds of
Data Models we can, should, would, must have in the process of designing,
developing and managing information systems (like Data warehouses). We talk
about conceptual, logical and physical data models, usually in the context of
certain tools, platforms, frameworks or methodologies. A confusion of tongues
is usually the end result. Recently David Hay has made an interesting video
(Kinds of Data Models ‑‑ And How to Name them) which he tries to resolve this
issue in a consistent and complete manner. But on LinkedIn this was already
questioned if this was a final or universal way of looking at such models.
.. I note that if you use e.g.
FCO‑IM diagrams you can go directly from conceptual to "physical"
DBMS schema if you want to, even special ones like Data Vault or Dimensional
Modeling. I also want to remark that there are 'formal language' modeling
techniques like Gellish that defy David's classification scheme. They are both
ontological and fact driven and could in theory go from ontological to physical
in one step without conceptual or logical layer (while still be consistent and
complete btw, so no special assumptions except a transformation strategy). The
question arises how many data model layers we want or need, and what each layer
should solve for us. There is tension between minimizing the amount of layers
while at the same time not overloading a layer/model with too much semantics
and constructs which hampers its usability.
.. For me this is governed by
the concept of concerns, pieces of interest that we try to describe with a
specific kind of data model. These can be linguistic concerns like
verbalization and translation, semantic concerns like identification,
definition and ontology, Data quality constraints like uniqueness or
implementation and optimization concerns like physical model transformation
(e.g. Data Vault, Dimensional Modeling), partitioning and indexing.
Modeling/diagramming techniques and methods usually have primary concerns that
they want to solve in a consistent way, and secondary concerns they can model,
but not deem important (but that are usually important concerns at another
level/layer!). What makes this even more difficult is that within certain kinds
of data models there is also the tension between notation, approach and theory
(N.A.T. principle). E.g. the relational model is theoretically sound, but the
formal notation of nested sets isn't visually helpful. ER diagramming looks
good but there is little theoretic foundations beneath it.
.. I personally think we should
try to rationalize the use of data model layers, driven by concerns, instead of
standardizing on a basic 3 level approach of conceptual, logical, and physical.
We should be explicit on the concerns we want to tackle in each layer instead
of using generic suggestive layer names.
I would propose the following (data) model layers minimization rule:
A layered (data) modeling scenario supports the concept of separation
of concerns (as defined by Dijkstra) in a suitable way with a minimum of layers
using a minimum of modeling methodologies and notations.
Physical data model governed by business (logical) data model
Jaunine Conradie (student) asks: (email to me on 2012 Aug 13)
The typical Teradata
architecture includes a staging layer (stages data from sources in the form
that they exist on the source), an integration layer (3NF integrated data from
stage), and a semantic layer (dimensional view on the integration layer data).
.. We are two modelers on the project. The problem that we have is that
the ETL lead is demanding a model change to facilitate data loads and queries.
.. In short, a Subscriber on the network (Vodacom here in South Africa,
which is like Sprint or Verizon in the USA) have various Access Methods of
various Access Method Types. Types can
for example be Primary (cell phone number), Data (for internet access), etc.
The change being requested by the ETL lead is to host the Primary
access method on Subscriber.
This is problematic because:
1. This is inconsistent with the implementation, where some access
methods for a subscriber will now be carried on the Access Method Subscriber
entity, and some on the Subscriber entity itself.
2. We do not (and should not) carry history on the Subscriber entity.
3. Teradata has specialized index types to facilitate joins (this
negating the request to denormalize because of query performance)
How do we convince the program manager to not let ETL‑related
considerations dictate the design of the physical model?
Definitions of concepts should start with their supertype. Why not?
Andries van Renssen -posts on LinkedIn.
A good definition of a concept should build on the definition of more
generalized concepts. This can be done by two parts of a definition:
1. Specifying that the concept is a subtype of some other more
generalized concept.
2. Specifying in what respect the concept deviates from other subtypes
of the same supertype concept.
Such kind of definitions have great advantages, such as:
‑ The definition of (and the knowledge about) the supertype concept
does not need to be repeated, as it is also applicable ("by
inheritance") for the defined (subtype) concept.
‑ The definition implies a taxonomy structure that helps to find
related concepts that are also subtypes of the same supertype (the 'sister‑concepts').
Comparison with the 'sister‑concepts' greatly helps to refine the definitions.
Furthermore, the definitions prepare for creating an explicit taxonomy.
‑ Finding the criteria that distinguishes the concept from its 'sister
concepts' prepares for explicit modeling of the definitions, which prepares for
computerized interpretation and the growth towards a common Formal Language.
--Thus, why not?
Labels:
attribute,
classification,
concepts,
extensional set,
generalization,
identification,
identity,
intentional set,
language,
ontology,
population,
role,
specialization,
subtype,
supertype,
taxonomy,
vocabulary
Can we argue more about business rules?
Posted by Kamran
Ayub - Thursday, May 6, 2010, 06:24 PM
I think it's interesting; it's too bad we waited
so long to argue about it. Let me say
first off: I understand both arguments. I have my own opinions, but I can
definitely see both sides to this. First,
I think we should establish some FACTS that we can all agree on:
1.
No DBMS currently available (for at least production-oriented
apps) can completely express every business rule we need to express innately.
(Src: experience and Everest)
2.
No matter which way you slice it, business rules eventually
get turned into code somewhere whether it's at the DBMS or in
application logic. At some point, there's a piece of machine language where the
rule is expressed and validated in 1s and 0s.
There,
now that we agree on those two points, what exactly are we arguing about? If you believe Fact 1, then it follows
that it is not possible to ever express all our business rules in the DBMS
(at this current state in time).
Here's my view:
All DBMS do enforce a limited set of business rules. Thus, if you
argue that your application code holds all business rules in one central
place, you are wrong because the DBMS enforces a few of those
constraints (mandatoriness/uniqueness/etc.). Most of us end up duplicating
these rules to avoid catching exceptions, without thinking about it. Secondly, you cannot guarantee that
there is only a single point of entry to your data. Don't argue with me; there is a sysadmin account
somewhere on the system the data is stored and at least that account has
full authority to change the data. How many of us input sample data into our
SQL databases to fix a data error, sample data, or test triggers? Exactly.
Besides, let's not forget you had to define that structure somehow! It may be
true your business users can't access that data directly, but somebody
is the data steward and she has full control, rules or no rules. By the way,
what about unauthorized access? Whoops, guess that hacker didn't
care about all your application-enforced business rules! It's fine if you argue that the
hacker could completely destroy your system but you must agree that if your
logic is stored in the application, at some point due to the nature of TCP/IP
that your data is out in the open between your app and your database. That
could potentially be tampered with, and your application can't do anything
about it.
So now, the argument is not
whether it's better to store business rules in application logic, the question
is why can't we put those rules in the DBMS itself, innately? The answer
to that is simply because no DBMS allows that just yet. The reason I think people argue for storing
rules in the application is simply because there's really no other better
alternative right now. If a .NET developer is comfortable with .NET and not
with T-SQL, where do you think she'll put her business rules? In the .NET
application or some other application gateway (not the DBMS). This is what I
heard today in our discussion. We put all our rules in application logic
because it's the best central repository we have... when really the best
place should be the DBMS. After all, what have we been learning all
along in this class? ORM is able to capture many of these rules... and yet, not
all. Business rules are there to establish integrity with our data so they should
be in the DBMS.
I believe it's useless to argue for or against storing rules in
application logic because right now it's impossible to put them all in the
DBMS. Even if you wrote stored procs or triggers for everything, that's not
enforced by the DBMS innately (triggers are closer to automatic) plus the rules
are still in code. Furthermore, I
believe most people put their rules in application code because it's just
harder to do it in the DBMS and that many enterprise or even regular
applications can target different storage systems (one application could use
XML, MSSQL, MySQL, DB2, or MS Access as repositories). Imagine duplicating all
your business rules across those systems... it's much easier to store the rules
in the code, isn't it? Let me be clear: I
am not advocating that it is right to store business rules in app logic.
I also believe that it is best to put business rules in the DBMS. All I am
saying is, what choice do we have at the moment? Until a better DBMS is created,
preferably using ORM as the tool to create the DB then we're SOL. Let the discussion begin!
EVEREST RESPONDS:
.. I love it. I agree with all you
have said. I just would not characterize it as "two sides of an
argument." Both "sides" must spring from an environment of
understanding of the basic concepts. There are two main traps we fall
into:
1. Believing that someday we can reach the ideal
state of a perfect world. Your discourse is filled with those thoughts,
e.g., "no DBMS currently available..." (implying that one may be in
the future), "at this current state in time" (suggesting that there
may come a time), "no DBMS allows that just yet" (but perhaps a
future DBMS product might?), "right now it is impossible" (with the
implication that in the future it will be), "until a better DBMS is
created" (better yes, but never perfect or complete).
"Man" (which includes woman) is imperfect, and therefore, so are his
constructions, our DBMSs, our databases, our code, our systems. So try as
we might to improve (and we should), we must always be alert to imperfection,
weaknesses, failures, attacks, (whether inadvertent or deliberate, intentional)
and build in countermeasures. That is why we back up our data, build in
redundancy, and why every government (every locus of power) must have "her
majesty's loyal opposition" and the watchful eye of citizens and the
media.
2. We tend to use, to favor that which is more
familiar to us. If we are a programmer and that is what we know how to do,
we will get things done by writing code in our favorite programming
language. If we know how to use the full functionality of some DBMS, then
that is what we will do as much as we can. Some application coders need
to be stretched to consider an alternative way of doing things that might get
us closer to the ideal of serving people (management, organizations) AND
incorporating appropriate safeguards (which also will never be perfect).
.. The business rules and integrity constraints
are those mechanisms that can lead to higher quality data, greater user confidence
(in our databases, our systems, our IT departments, and our organizations), better
decisions, and more effective and efficient operation. As you rightly state,
ALL that happens in a computer system is done by executing code. But that
is only half the story. ALL that executing code is driven by
specifications, definitions, declarations, input parameters, in other words
"data" more broadly interpreted. Even the program code is the
specification of a process. The accuracy and completeness of the process
depends on the specifications, the directives, etc.
.. As I show in my lecture on Business Rules (slides
BRules.23 and .27), there is a hierarchy of levels where this information, the
business rules, can be expressed:
1. DBMS
2. Stored Procedures, whose execution is under
the control of the DBMS.
3. Code (methods) embedded (encapsulated) in
"objects" which execute in a well-managed and controlled environment
(OO, SOA, SaaS...)
4. User application code
5. External user actions (manual or sysadmin)
.. At all these levels the processes are driven
by definitions, specifications. The efficacy of those processes depends
on those specifications, and we (humans) understand those processes through
documentation which itself is usually quite imperfect.
.. Moving down these levels we encounter
decreasing levels of control and increased dispersion. I put DBMS at the
top, because that is the closest to the data and the most likely centralized
point of control (least dispersion). Our objective is (or should be) to
move these boundaries down so that work gets pushed up into more controlled
environments. Another thing that happens as we move through these levels
is increasing abstraction. When we declare database integrity
constraints, that is a specification (abstraction) of some underlying process
(executed by code... somewhere). We (humans) do not need to know what goes
on under the covers, we merely need to have confidence that it will be done
correctly and consistently, as reflected in the definition or
specification. For example, we may understand the meaning of the square
root operator, but how it gets executed on a computer is quite complex. I
want to be hidden from that complexity. When we write a program we call
all sorts of subroutines or functions, which are defined at a pretty high level
(of abstraction).
.. To the comment of "too bad we waited so
long": The focus of my course is data modeling. The subject of a
companion course would be DBMS and Database Administration. This issue we
are discussing would properly belong in the second course. However, I
raise the question here because the success of maintaining the integrity of our
databases according to our defined semantics depends upon the execution
environment, the DBMS ++.
.. Let the discussion continue.
Subscribe to:
Posts (Atom)