Everest on Data Modeling: December 2017

Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2017-12-22

Natural or surrogate keys – selection criteria

In a post on the Data Modeling forum in LinkedIn (2012/05), Bryan had 2 criteria for picking a key, Tony added a third and Kim had 4. Let me give you my nine which I put forth in my advanced database design class at the University of Minnesota (see http://geverest.umn.edu):

CRITERIA: (conflicts necessitate compromise)

* Unique ‑ Guaranteed!

* Ubiquitous ‑ have a value for every entity instance, ever. In the US the Social Security number was established in 1935 but we are now running out. As another note, the SSA had no mandate to ensure that it was unique until many decades later ‑‑ not bad with a 75 year run!

* Unlimited ‑ won't ever run out of values; reuse?

* Unchanging, never changes ‑ "immutable"

* Under your own control ‑ manage the values/codes, or at least some independent body exists to assign and maintain codes to unsure uniqueness. E.g., the Universal Postal Union (http://www.upu.int) maintains the official names of all countries in the world. New countries must apply for and have approved a name, or a change from an old name or else they will be unable to exchange mail with other countries.

* Used by people in the user environment, if not, can you come up with one that satisfies the next criterion?

* 'Mnemonic' (easy for people to recognize/remember/generate)

* "Dataless" ‑ sole purpose is to identify ‑ should carry no other information; don't overwork an identifier

* Compact, easy and efficient to process and store ‑ often means 'numeric' and 'fixed length' (secondary)

> else, invent a surrogate (“Autonumber”) ‑ only if none exists naturally

Anyone have a 'U' word for any of the last three???

These criteria will usually get you close. If you can't get close, then take the 'else.' It is important to distinguish the need for an identifier for users as expressed in a business oriented data model, and the model to be implemented in some DBMS. These two purposes are somewhat at odds with each other. If you can't come up with a suitable natural key (according to the above criteria, and some compromise is usually necessary) then use a surrogate key. Then some sort of lookup would be needed for the users to find the surrogate key based upon the best information they have when doing a search. For almost any sort of natural key, you can nearly always find examples which make things difficult. e.g., we run out of numbers, a state changes the name of a county (so any number ID assigned based upon alphabetical order becomes invalid), we change characteristics on which the ID was assigned (e.g., should color be in the unique ID for an appliance? It usually is but I can paint my refrigerator ‑‑ color is an attribute, and should not be part of the ID).

One more comment. When we speak of having an identifier, it will only be unique within a defined and finite name space, that is, a particular context. If we go out of that name space we need to qualify the ID with an additional identifier of the expanded context, ad infinitum.

TABLE THINK - multiplicity (manyness) in a relationship?

Suppose we have a relational table - [ X | A | B | ... ] where A and B are attributes of X.

Two common confusions:

(1) First, we can say that there IS a relationship between X and A. Since we observe that for each X there is only one A (1NF), but for each A there could be many Xs, we may be led to say there is a one‑to‑many relationship between X and A. However, that relationship is actually many X to one A, thus many to one, not one to many, as stated. When you say one to many (between X and A), I must apply them respectively, otherwise, confusion results from the ambiguity and inconsistency. So the relationship is Many to One between X and A, respectively.

(2) We observe that for each value of A there could be multiple different values of B in the table of X and vice versa. Does that mean there is a many‑to‑many relationship between A and B? Any discussion of oneness or manyness must presume there is a relationship. When you say there is a M:N relationship between A and B, then I must be able to list all the valid pair wise associations as in a table with two columns. And being M:N the same value of A can appear in multiple rows and the same value of B can appear multiple times. Of course the exact same values of A and B on a single row could never appear more than once since it would be the same instance of the relationship. Now for any given X, there is a value for A and a value for B. Would that particular value of A be the one that is related to that value of B? NO. Furthermore, could there be the same pair of A and B values appearing on multiple rows in X? YES. So the juxtaposition of A and B in X does not represent a relationship between A and B. It is spurious. IF there is a relationship between A and B, it must be represented in another table, and could be 1:1, 1:M, M:1 or M:N. But since we said that we know everything about X, A, and B and it is represented in the diagram, then there must not be a relationship between A and B. Again, you can look at the table of X and see that there are multiple different values of A appearing beside the same value of B and vice versa. However, that in no way means there is a relationship between A and B represented in the table of X.

Physical Data Model vs. Implemented Database Schema ‑ To what degree should they match?

By: Brent (Linkedin Discussion: Started 2011 Nov. 9)

. . We're having an internal discussion about PDM standards. The existing standard says that, for all tables, columns, keys, indexes, and relationships, the PDM should match what was implemented in the physical database. We have a huge Teradata database that has no primary key constraints, foreign key constraints, or indexes implemented in the database, but they are still in the data model. Those components are part of the application code used to load the tables, so the total database implementation = database schema + data load rules, and the standard is satisfied. Agree? Disagree? What's your standard?

EVEREST RESPONDS:

First recognize that a data model goes through several stages from a detailed “conceptual” representation of the users’ world to final implementation in some DBMS. This question suggests there are at least two stages. Ideally, as the data model is transformed from one stage to the next, the (user) semantics are retained (perhaps with differing syntactical representations), but additional features and constructs may be added. In general, the model at each stage will be different than the previous stage. So I would answer that they need not and should not match, except in the semantics of the users’ world.

Note that user semantics does not include things such as clustering data items into records/tables, converting the representation of relationships into foreign keys, adding indexes, surrogate keys, enforcing first normal form (other normal forms are still relevant), etc. These are all introduced for the convenience of the system to be implementable in some data modeling scheme (such as relational) and some target DBMS, and reflecting storage efficiency and performance goals. Note that none of these have anything to do with the users’ view in a data model, and should not be reflected in that first stage model (some call this the “conceptual” data model). Here the notion of conceptual should not be confused with a high-level, or abstract presentation of a data model. While we may present a data model in a high-level, abstraction, the underlying model needs to be as detailed as necessary to capture as much as possible of the exogenous semantics of the users’ world. That model should have no consideration or constraints relating to physical implementation.

Back to the question. The implemented database schema (in some target DBMS) may have some additional elements defined in a way that is peculiar to that system, but need not be expressed in the physical data model. While the physical data model is a prelude to implementation it is primarily a vehicle for communication with people, to understand the data model. The physical data model can include generic elements of a physical representation but not in the way which might be represented in the implemented schema.

Of course, it is important to document (for people) the transformations (mappings) made from one stage of a data model to the next, and the rationale for those choices. This documentation is key. From the discussion so far, people seem to be assuming that all you have are the two data models, and the issue is to what extent are they different or should they be the same. If you want them to be the same, then I would argue for two physical data models - one generic, and one tailored for the target DBMS implementation. However, this is more work and probably not very useful. So keep them different, and document the differences. The tool (DBMS or data modeling tool) should have some ability to produce a human readable representation of the stored schema, even in a graphical form. We need both logical/physical models and “logical maps.”

We should note here that data modeling schemes and DBMSs have their limitations (often chosen for efficiency reasons). Hence, we need to transform or fudge or introduce (sometimes spurious) constructs to allow certain semantics to be represented in the next stage model. For example, to transform an ER model to a relational model we must impose the “single atomic value” constraint on attributes (1NF) which means we must resolve all M:N relationships into two 1:M relationships with a intersection entity. Also, we must represent all relationships with foreign keys, which is another reason why we cannot represent M:N relationships in a relational model. None of this is necessary in the users’ conceptual model. People don’t have a problem understanding M:N relationships! Only a relational DBMS does! One approach to implementing complex user constraints which are not expressible in the target DBMS is to embody it in triggers and stored procedures, or worse yet, in application code. One example might be enforcing referential integrity, if that feature is not available in your target DBMS. Now the semantics really get buried and difficult for users to comprehend.

To Alan Howlett’s point, if we try to (manually) maintain a PDM and an implemented schema which are intended to be the same, that is simply an invitation to introduce inconsistency. If they are to be the same then they must be stored as a single data model. Hopefully you have a tool which can generate user-friendly representations of the model, a graphical representation, or a schema for your DBMS (multiple?).

Logical vs. Physical data model

Jaunine Conradie (former student) asks:

. . Following the differences between logical and physical data models (e.g., include many‑to‑many relationships thus excluding association entities except if attributes are associated with the association entity)...

Here's the question: If the task is to draw a relational model, do I draw a logical or a physical data model?

If the answer is a logical data model, then I expect to not see association entities and thus also not the foreign keys that would have been included in the association entity.

If the answer is a physical data model, then I expect to see every attribute, as it would be implemented in a specific RDBMS (thus, including data types), based on a full analysis of all available attributes and their values.

Everest response:

First, I think we need to drop the term "logical" model. All data models are logical, and some have a physical representation by being implemented in some DBMS. The distinction between logical and physical models arose several decades ago in an effort to rise above the current data modeling schemes which were just trying to draw pictures of a database as implemented and stored physically. Today we have all sorts of "logical" data modeling schemes.

. . What is important is that every data model is done with and conforms to some modeling scheme, whether logical or physical. So we can have logical models which are relational, network, hierarchical, single flat file, extended network (allowing M:N relationships), ER, Extended ER (which included ternary relationships and sub/supertypes), multidimensional,... and Fact Oriented Models like ORM.

. . Terry Halpin refers to Relational models as "logical" and ORM models as "conceptual." OK, but don't get hung up on the terminology; it is the semantics that are important, not the syntax, i.e., the choice of words to describe semantic concepts.

. . If the answer is a logical model, then it will be drawn according to some modeling scheme, which may be relational, in which case in includes foreign keys and intersection entities for M:N relationships. This logical model could include attribute data types, although that information is getting closer to the physical. Even though we call ORM conceptual, we do specify data types for the objects!

. . If the answer is a physical model we will generally include more information relating specifically to some DBMS, for example, conforming to the naming conventions and the available data types in the DBMS. I also suggest dropping the term "physical" model, preferring instead "implementation" model.

. . The truth of the matter is that any data model can be represented over a range of stages, from conceptual to physical, and at each stage it conforms to a particular modeling scheme.

On the origin of the fork notation

A student posts:

Dr. Everest, I just caught the fact that you developed the "fork" notation.

Good thinking, it is the most clear and simple of all 1:M relationship notations by far!

‑It occurred to me that maybe the profile of a badminton birdie was your inspiration?

Everest response:

At the time when I did that (in a 1976 paper*, which became chapter 4 in my Database Management book, McGraw-Hill, 1986, §4.3.2, p.132) my idea came from trying to resolve the difference between the then dominant diagraming scheme of an arrow from parent to child (due to Charlie Bachman in a 1969 paper), and Nijssen drawing the arrow in the opposite direction to represent a function. What I did was move Nijssen's arrow head to the other end of the arc and voila, a fork!

*Here is the reference (click on the title to get a copy of the paper):

Gordon C. Everest, "BasicData Structure Models Explained With A Common Example" Computing Systems 1976, Proceedings Fifth Texas Conference on Computing Systems, Austin, TX, 1976 October 18-19, pages 39-46. (Long Beach, CA: IEEE Computer Society Publications Office).

Original paper using a notation for relationship characteristics. Thought to be the first to use what is now called a "fork" to represent multiplicity in a relationship. An updated version appeared as Chapter 4 in Database Management: Objectives, System Functions, and Administration, McGraw-Hill, 1986.

2017-12-20

Why number the rules of normalization? Is it helpful?

Chris Date/Ted Codd originally numbered the rules of normalization, so we have 1NF, 2NF, etc. They were defined in a cascading fashion. For example, for a data structure to be in 3NF, it must satisfy the 2NF rule, which must in turn satisfy the 1NF rule. This definition has persisted in virtually all introductory text books on data management and data modeling, evidently with the authors not applying any critical thinking to the question. The authors just blindly accept what has been written before.

So is it possible for a relation (entity record or table) to have no transitive dependencies (thus satisfying the condition for 3NF) but have a partial dependency (2NF). Would that situation mean that the record/relation did not satisfy 3NF? Is it helpful to say that? Even more, when we speak of nested relations or object modeling (as in UML, for example) they explicitly violate 1NF allowing nested repeating groups of data items or attributes in a record/relation. Does that mean that automatically such structures do not satisfy 2NF or 3NF, even if they had no partial dependencies or transitive dependencies?

The rules of normalization should be expressed or named by the condition which causes a violation.

1NF = NO multivalued dependency (or multivalued attribute thus requiring a many-to-many relationship).

2NF = NO partial dependency, i.e., the data item "attribute" is only dependent on part of a composite key.

3NF = NO transitive dependency on an intermediate data item.

...

Then the rules of normalization are independent and can be applied in any order.

Can we avoid the need for normalization entirely?

Normalization is the Achilles heel of data modeling. Even experienced data modelers get it wrong sometimes. Wouldn't it be nice if we had a data modeling scheme which did not require the modeler to think about and apply the rules of normalization? How might that be possible?

Any form of record-based modeling (e.g., Relational, ER, IDEF1X, UML...) requires the enforcement of normalization rules. In record-based modeling, data items are clustered together in records (or tables) to describe a particular entity type, the data items are then called "attributes" of the entity. The problem is that sometimes we put attributes into a table when we shouldn't. The rules of normalization help us to identify which attributes violate the rules, and hence must be removed from that table (and probably put into some other, often new, table).

Unfortunately, this is necessary but not sufficient for a good design. The rules of normalization are only applied to a single record/table at a time. There may be inter-record relationships which require further examination to ensure a good design. Another problem is: you must first determine or designate the identifier before you can apply the rules of normalization. That is because normalization is all about what "determines" each attribute. To paraphrase Chris Date (which has been widely recounted) "every data item (in a record) must be single valued (for each value of the identifier key) [1NF], must be functionally dependent on the key, the whole key (if composite) [2NF], and nothing but the key (i.e., directly dependent with no transitive dependencies) [3NF] ... so help me Codd!"

Normalization is the test, and record decomposition is the remedy for violations (i.e., removing data items from the record). The need for normalization can be avoided if we don't cluster data items into a record in the first place. That is the essence of fact (oriented) modeling, such as ORM. The inverse of functional dependency is determinancy. In a normalized record, the determinant of an attribute is the key or identifier. In fact modeling, we directly relate each data item with its determinant. That then gives us the unambiguous way to store data items into records or tables without violating any of the rules of normalization. In a sense, a fact model is the ultimate decomposition of a (record-based) database, such that each record now has at most one non-key domain. Some have called this the 6th NF, but beware, there are other definitions of 6NF.

So the answer to your question is yes... if we design our data models using fact modeling we can avoid the need for normalization entirely!

Five rules of normalization in one

Do you think that it is possible to represent the five rules of normalization in a single rule?

Well, here it is:

"Store each data item with its determinant as the identifier in a record/table."

By data item I mean any data element consisting of a domain of values (which are surrogates of things in the real world), which are members of a population called by the name of the data item/element.

Determinant is based on a relationship between two populations. If the members of one, say Y, are functionally dependent on the other, say X, then we say that X determines Y. The functional dependency can be expressed as Y = Fn(X). For example, if we had department as one population and employees as another, and we assume that an employee can work in at most one department (at a time), then there is a functional relationship between them: Dept = Fn(Employee), that is, given a member of the employee population (ID) that determines a unique value for Dept, hence a function. If there cannot be a Y without an X, then we have a functional dependency.

In a side note, the relationship could involve a single population, in which case we call it a reflexive relationship. An example would be "is boss of" on an employee population.

Why isn't ORM more popular?

2010 December on Linked In Adam Ralph asked:

About 4 years ago, I was tasked with designing a new data centric application. My first instinct was to draw a conceptual model. For previous projects, I'd used entity relationship modelling but I was keen to find out if there were any alternative approaches. Scouring the internet, I stumbled across ORM. I printed out Terry Halpin's article "Object Role Modeling: An Overview" and studied it over the next couple of days. The approach seemed rather compelling, especially given the natural language aspect. So, I worked my way through CSDP for part of the UoD for the application. The model which presented itself, with the automatic verbalization and relational model translation was so elegant that I have never looked back.

ORM has now become my starting and reference point for most of the work I do. Yet, I'm constantly surprised that no‑one else I encounter has even head of it. After four years, including a change of company (from one large blue chip to another) I still have not encountered a single person who has heard of ORM, let alone practiced it.

I believe ORM is by far the most expressive and natural data modeling technique currently available, so the question arises, why isn't it more popular?

EVEREST RESPONSE:

I am often asked this question. It is addressed in my Advanced Database Design course which teaches ORM. Let me give you some reasons from my perspective. Some reasons stem from the early years, others relate to the present state of data modeling practice.

In the early years in the Evolution of NIAM/ORM:

(1) CDC kept it proprietary.

(2) Few academics were involved (who need to publish to keep their jobs and get promoted!)

(3) Lack of strong vendor support

(4) Lack of good papers, text book, and inexpensive, viable tool to support NIAM.

ORM has its roots in the work of Sjir Nijssen, with later notable contributions by Terry Halpin. In 1975 August in Germany I had the privilege of teaching a course with Sjir. At that time we reviewed a couple of draft papers he had prepared on the work he was doing on the Kadaster project in the Netherlands while working for Control Data Corporation (CDC). The papers were published in 1976 and he called it “binary modeling.“ CDC closely guarded the method as valuable intellectual property and began offering Information Analysis services to clients (for a fee). This became known as the Nijssen Information Analysis Method (NIAM). Most of the papers were published in obscure journals so did not get out to a wide audience (unlike the seminal paper Ted Codd published in 1970 on the relational model of data). CDC developed an automated tool to support NIAM, as did others in those earlier years but they did not become popular. Most of the activity was in Europe with little exposure in the USA, except perhaps for my course in Advanced Database Design at the University of Minnesota which I have taught since 1971. I came back from the 1975 encounter with Sjir convinced of the superiority of the modeling method over relational, ER, or CODASYL network modeling. I was so excited that I incorporated it into my class. However, it was little more than a lecture or two. There were no really good papers, no text book, and no viable tool.

Then Sjir Nijssen went to the University of Queensland in Australia in 1983 as a visiting chaired professor. There he hooked up with Terry Halpin who saw the superiority of NIAM. Together they published the first comprehensive treatment of the modeling scheme which was renamed Object Role Modeling (ORM) in 1989. Then some others saw it as a better way to do data modeling and developed an ORM modeling tool (InfoDesigner from Serverware, to Asymetrix as InfoModeler, Visio, and finally Microsoft). Finally we had a tool and a textbook. I began teaching predominantly ORM in my Advanced Database Design class in early 1990s. Today it is available wholly online. Go to http://geverest.umn.edu for more information. The course is offered only in the spring, and 2011 begins on Jan 20.

(Terry and Sjir can validate and correct the information above).

Reasons for ORM‘s lack of popularity stemming from Current Data Modeling Practice - PART 1

() ORM is very difficult to grasp on your own by reading a book, without good training. Adam, you are to be commended for taking the time and having the persistence to really dig into and understand ORM. My course provides the necessary knowledge and skill to use ORM and its data modeling tool, NORMA.

() blind acceptance in the industry that ER/Relational data modeling is the best way for data modeling. Practicing data modelers are always thinking in terms of ER diagrams as their data models, or worse, relational tables. I call this “Table Think“ and the malady is “Tableitis.“

() Failure to recognize the problems inherent in the use of any record-based data modeling scheme, that is, based upon the formation of entity records or objects with a set of attributes. Normalization is the way to detect and resolve some of these inherent problems. Unfortunately, it is practically and theoretically impossible for any DBMS or data modeling tool to help the designer produce normalized structure, unless... they use ORM. ORM does not distinguish the notion of entities/objects from that of attribute, everything is a domain or population of something (called an object, which is distinctly different from that in object-oriented modeling). Object have attributes by virtue of the role they play in relationships with other objects. Hence, ORM has two constructs – objects and roles (in relationships).

() Most practicing data modelers think they are doing a good job. They do not realize or consider that there may be a better way. My course is a hard sell to experienced data modelers. But give me a chance to talk to them and it at least opens up the possibility of a better way. I have had some skeptics start my course, but they soon get converted because ORM is so compelling. I have even had some that go to the end of the class unconvinced, and come back to me months or years later to say they finally got it! I thank former students, Kevin and Troy, for their postings on this discussion group.

() nearly all textbooks on database management or database design present only record-based data modeling schemes. So we continue to perpetuate the myth that this is the way to go.

Reasons for ORM‘s lack of popularity stemming from Current Data Modeling Practice - PART 2

() Most published research studies comparing various data modeling schemes including ORM report that it is inferior. Upon closer examination, we discover that the experiments were conducted by people who had incomplete knowledge and skill in ORM.

() ORM is often criticized as being bottom up and too detailed for conceptual modeling. We hear this both from proponents and protagonists of ORM. So we have subject area diagrams, or high-level ER diagrams. That just means some of the underlying semantics are hidden in the presentation. The attributes are still there, even if we only show entities and relationships. Being too detailed or not is a matter of presentation, not modeling. The goal of (conceptual) data modeling is to capture as rich a set of semantics as possible of the users' world being modeled. But it is foolish to present it all at once (are your conference room walls papered with diagrams to show all the entities and attributes in your enterprise data model?) Since humans have cognitive limitations, we present various abstractions of the model, eliminating parts or detail, to facilitate human understanding.

() All the major DBMSs are based on the relational data model(ing scheme) so we often model in tables.

() The annual ORM workshop held under the OnTheMove conferences does not get beyond a small group of very dedicated enthusiasts.

() Lack of vendor support. With the acquisition of Visio 1999, Microsoft got the InfoModeler ORM tool. They added it to the enterprise team editions of Visual Studio as Visio (not the same as is in the Office Suite). Unfortunately, Microsoft has done nothing with the ORM tool. It has essentially the same functionality as Visio got when they acquired InfoModeler 3.1. Microsoft has not enhanced the functionality of the tool and provides no training in ORM or the use of the Visio tool. Fortunately, there is a better alternative tool now available called NORMA -- open source and runs as a plug-in to Visual Studio. With a valid ORM model, just push a button and NORMA generates a relational database guaranteed to be in fifth normal form, and ready for direct input as a data definition to any one of several popular DBMSs. Nevertheless, NORMA still has much maturing to do. It needs additional enhancement to become an industrial strength data modeling tool, and that is in the works at LogicBlox.

That was probably more reasons than you wanted. If you desire a taste of or motivation for ORM, you can attend my Workshop entitled “Rethinking how we do data modeling“ at the DAMA Enterprise Data World conference (edw2011.wilshireconferences.com) , in Chicago, 2011 April 3-7. I will also be giving a presentation entitled “Helping Business Users Understand Your Data Models." More information is at my website: http://geverest.umn.edu . So the challenge is: how do we get the word out?

Value Object Type vs. Entity Object Type

A student asks:
... I'm still somewhat confused about when to use an entity type object and when to use a value type object. Let's take an example. Suppose we have two related columns of data:

Day Day

Code Name

Su Sunday

M Monday

T Tuesday

W Wednesday

Th Thursday

F Friday

Sa Saturday

It seems that there might be two ways to represent such a 1:1 relationship. We might have a fact statement with two entity type objects, written as:

DayCode(.code) has DayName(.name)

or an entity type object and a value type object, written as:

DayCode(.code) has DayName()

In either case, the predicate constraints could be unique/mandatory on both objects. While both fact statements seem to have the same results, what are the advantages and disadvantages of using one or the other in such a situation?

EVEREST RESPONSE:

Good question. I refer you to Halpin [2008], pages 10, 65, and 95. He doesn't say much about this but perhaps enough to understand the idea. . A value object type is a constant. It is self-identifying, that is, when you see the value written down, you (the user community) always know what is being referred to. So in your example, we always know what day of the week is referred to by either your DayName or DayCode. Each could be a value object type. In each of these two examples, what you really have is merely a string constructed from some alphabet. There is another population here and that is the “days of the week”. That would be a non-lexical object type (NOLOT). Then we can use a name or code to refer to the day of the week. We will pick one of them to be the reference mode for day of the week. And if that is always sufficient and unambiguous, we can make it a value object type… Let me take you through the thinking to answer your question. Everything we want to include in the database can initially be thought of as a NOLOT. (See slide# ORMod.42, which is essential to understanding this distinction which is made in Slide 43). Then we can ask, what are the different ways of referring to those things? Each of those would be a separate object type. If we pick one of those, the day of the week can be referenced directly (with the name or code) and hence there is no need for some other scheme for identifying or referencing days of the week. Of course, the choice of identifier must be unique, ubiquitous, etc. (as discussed in class, slide# ERel.38)... NOTE that a value object type can also be a number. That includes all the number types we talked about in class. Also note that the string of numeric digits is actually referring to some point on a scale (that is the NOLOT), which can have various arithmetic properties (cardinal, interval, ordinal, nominal). We have to know what the base is in determining the number from the string of digits. For example, '10' is a very different number (the NOLOT) depending on the base, be it binary, or octal, or decimal. The representation of the number will be some chosen lexical object, constructed from some naming scheme. Again, in a particular context, the string of digits may be sufficient to always, and unambiguously refer to (represent) the number value, for all the users in that community. Hence, we could make the number a value object type… As you can see, this is not always hard and fast, depending on the user context. For example, we may be interested in how days of the week are referenced in other languages. Further, we may want to relate these to the days in some calendar other than the Gregorian calendar, which is used by most of the English speaking world. So you see, we really have to go back to thinking of the day (no longer of the week since other calendars may not have the concept of a week) as a NOLOT, apart from how it is referenced. The value object type is only used when values in a particular referencing scheme are sufficient and always unambiguous to refer to members of the NOLOT population. The NOLOTs are always there in the world being modeled. The naming and reference scheme are our own constructions. The conundrum is that we have no way to refer to members of the underlying NOLOT population (the semantics) without using the lexical handle (the syntax). As we said last week, we can only see or represent the semantics through the syntax… What is the advantage? The use of value types can simplify our model. We can let the value itself (the lexical representation) be the object. It is sometimes a convenient shorthand… Again, thank you for asking the question. It has given me an opportunity to think it through more clearly, I hope. Sorry for being so long winded.

Valuing Information

To value data we must look at its information content. What we really do is value information. Information is intangible intellectual property (and its tangible form is in the data). Valuing information “assets" is important to make the case for investment, governance, management, etc. in our organizations and to executives. No one doubts that information (and hence its recorded realization) is of value and necessary to the running of an enterprise. Notwithstanding that the benefits stemming from investments are hard to quantify, valuation is a worthwhile and necessary exercise, but it does not have to appear on the balance sheet to be recognized as of value.

As we think about this question, we must clearly establish which meaning of asset we are using, the accounting concept, or a more general meaning, namely, something of value. In accounting, something is put on the balance sheet as an asset because the expenditure benefits (i.e., helps to generate income in) future time periods. (And perhaps, to a lesser extent, to provide a valuation for the company. However, that is not always accurate because it is based on historical costs, and not current values). It is simply a mechanism to provide somewhat more accurate reporting of profit and loss in a given (future) time period.

I am reminded of an article written by James Hekimian (he comes up first on a Google search) entitled "Putting People on the Balance Sheet." I think it was published in the Harvard Business Review back in the 1960s when I had him as an accounting professor at MIT.

Your point is well taken about information needs to be managed with appropriate organizational arrangements similar to how we manage human resources, which we don't call "human or people assets." When a company making a takeover bid seeks to value another company they certainly consider the quality, etc. of the work force. That is not reflected on the balance sheet. Similarly, they should be considering the value, quality, availability, etc. of the information resources and their IT infrastructure when making a valuation for a possible takeover.

DATA vs. INFORMATION

I find this a useful distinction. Data is the syntactic representation of information (the semantics). Data is the bits and bytes recorded (“fixed in some tangible medium," as the copyright law states about what intellectual property can be copyrighted). Information is derived from data but requires definition and context (the meta data?). For example, I write down 55. That is data. What does it mean (the semantics)? If I tell you it is age, in years, that is helpful. If I tell you it is my age, in years, that is more helpful. Perhaps it is really the weight of John Jones taken on a certain date, at a certain location, on a particular scale, measured in kilograms, with no clothes on, and rounded to the nearest kilogram. Now the information has greater value. And of course, with age the information usually becomes less valuable (for most purposes). I loved Terry Hanold‘s definition of data given in an article in Datamation about 40 years ago: “Data is the digital shadow of haphazard events indifferently recorded." As you add definition and context, aggregate it, etc. it gradually becomes more and more valuable for some purposes.

Data As an Asset

Nigel asks on DAMA International LinkedIn website:

... I wonder if anyone can help. I’m looking for some information. The DAMA Body of knowledge makes it abundantly clear that Data is an “asset”. This got me thinking as I’ve always seen it as a resource. Regardless, assets tend to be reflected on the balance sheets of most companies. Is anyone aware of ANY company, anywhere, that has valued its data as an asset and then reflect that on its balance sheet / company books? I suspect it may be true of companies whose "stock in trade" is data or information, but for "normal" companies in the retail or financial services industry, I'd be curious if anyone has matured to this extent.

Everest responds:

Wow. What a great discussion. This is the first time I have seen it. Let me add my two cents. Perhaps I can lend some clarity. Information (in its syntactic representation as data) is of value to some people and for some purposes. Whether or not its valuation should be put on the balance sheet is essentially an accounting question. Much of the discussion above got off the balance sheet.

My accounting professor hammered home the point that we must always ask “what are you trying to show and why – for what purpose?“ Purpose could be for management decision making, SEC filing, income tax determination, stockholders, for possible sale or acquisition. Accountants do things differently for different purposes.

Several companies whose product is information do capitalize the cost of developing (and maintaining) their databases of information. That allows the accountants to spread (amortize) the costs over future periods when the revenues are realized. Matching costs in the same period as the revenue is generated is a fundamental principle of accounting.

With other types of organizations, we might argue that if data is an asset why is it not on the balance sheet. However, the connection with the future is rather tenuous. Hence we expense the costs, charging them to the period in which they were incurred. That is partly justified because database maintenance (and enhancement) is an ongoing expense. What is the benefit of pushing the expense to a future time period? It is like a training budget for human resources. The benefit is realized when employees work for the company in the future and can make a more valuable contribution with training (and experience). We would acknowledge that people are a valuable asset to an organization, but not that we should, therefore, capitalize their salaries because they will be able to help us generate revenue in future time periods.

When we say “data is an asset" what we are really saying is that data is something we have (it exists) and it is of value. Just because it is of value, does not mean it should be reflected on the balance as an asset. The question suggests two different meanings for the word asset - the narrow one strictly from an accounting perspective (to allocate costs to some future time period in which revenues are derived), and a broader, more general case of simply “something of value."

Income tax is another argument for not capitalizing data development and maintenance expenses. To minimize taxes, we try to write off expenses as early as possible, that is, in the period in which they are incurred. (Accountants generally go further and charge it as an expense when the obligation to pay is incurred, as with a credit card purchase.) That way it offsets current revenue so that we don‘t have to pay income tax on it. If the tax laws allow early write off of expenses (e.g., schedule 179 assets) but that does not accurately reflect the reality of matching costs with revenues, then we will keep two sets of books - one for the tax collector, and one for corporate managers and decision making.

Everest on Data Modeling