Intro

I have collected Q&A topics since about 2010. These are being put onto this blog gradually, which explains why they are dated 2017 and 2018. Most are responses to questions from my students, some are my responses to posts on the Linkedin forums. You are invited to comment on any post. To create a new topic post or ask me a question, please send an email to: geverest@umn.edu since people cannot post new topics on Google Blogspot unless they are listed as an author. Let me know if you would like me to do that.

2017-12-22

Physical Data Model vs. Implemented Database Schema ‑ To what degree should they match?

 By: Brent  (Linkedin Discussion: Started 2011 Nov. 9)
. . We're having an internal discussion about PDM standards. The existing standard says that, for all tables, columns, keys, indexes, and relationships, the PDM should match what was implemented in the physical database. We have a huge Teradata database that has no primary key constraints, foreign key constraints, or indexes implemented in the database, but they are still in the data model. Those components are part of the application code used to load the tables, so the total database implementation = database schema + data load rules, and the standard is satisfied. Agree? Disagree? What's your standard?

EVEREST RESPONDS:


First recognize that a data model goes through several stages from a detailed “conceptual” representation of the users’ world to final implementation in some DBMS.  This question suggests there are at least two stages.  Ideally, as the data model is transformed from one stage to the next, the (user) semantics are retained (perhaps with differing syntactical representations), but additional features and constructs may be added.  In general, the model at each stage will be different than the previous stage.  So I would answer that they need not and should not match, except in the semantics of the users’ world. 
Note that user semantics does not include things such as clustering data items into records/tables, converting the representation of relationships into foreign keys, adding indexes, surrogate keys, enforcing first normal form (other normal forms are still relevant), etc.  These are all introduced for the convenience of the system to be implementable in some data modeling scheme (such as relational) and some target DBMS, and reflecting storage efficiency and performance goals.  Note that none of these have anything to do with the users’ view in a data model, and should not be reflected in that first stage model (some call this the “conceptual” data model).  Here the notion of conceptual should not be confused with a high-level, or abstract presentation of a data model.  While we may present a data model in a high-level, abstraction, the underlying model needs to be as detailed as necessary to capture as much as possible of the exogenous semantics of the users’ world.  That model should have no consideration or constraints relating to physical implementation.
Back to the question.  The implemented database schema (in some target DBMS) may have some additional elements defined in a way that is peculiar to that system, but need not be expressed in the physical data model.  While the physical data model is a prelude to implementation it is primarily a vehicle for communication with people, to understand the data model.  The physical data model can include generic elements of a physical representation but not in the way which might be represented in the implemented schema. 
Of course, it is important to document (for people) the transformations (mappings) made from one stage of a data model to the next, and the rationale for those choices.  This documentation is key.  From the discussion so far, people seem to be assuming that all you have are the two data models, and the issue is to what extent are they different or should they be the same.  If you want them to be the same, then I would argue for two physical data models - one generic, and one tailored for the target DBMS implementation.  However, this is more work and probably not very useful.  So keep them different, and document the differences.  The tool (DBMS or data modeling tool) should have some ability to produce a human readable representation of the stored schema, even in a graphical form.  We need both logical/physical models and “logical maps.”
We should note here that data modeling schemes and DBMSs have their limitations (often chosen for efficiency reasons).  Hence, we need to transform or fudge or introduce (sometimes spurious) constructs to allow certain semantics to be represented in the next stage model.  For example, to transform an ER model to a relational model we must impose the “single atomic value” constraint on attributes (1NF) which means we must resolve all M:N relationships into two 1:M relationships with a intersection entity.  Also, we must represent all relationships with foreign keys, which is another reason why we cannot represent M:N relationships in a relational model.  None of this is necessary in the users’ conceptual model.  People don’t have a problem understanding M:N relationships!  Only a relational DBMS does!  One approach to implementing complex user constraints which are not expressible in the target DBMS is to embody it in triggers and stored procedures, or worse yet, in application code.  One example might be enforcing referential integrity, if that feature is not available in your target DBMS.   Now the semantics really get buried and difficult for users to comprehend.
To Alan Howlett’s point, if we try to (manually) maintain a PDM and an implemented schema which are intended to be the same, that is simply an invitation to introduce inconsistency.  If they are to be the same then they must be stored as a single data model.  Hopefully you have a tool which can generate user-friendly representations of the model, a graphical representation, or a schema for your DBMS (multiple?).

No comments:

Post a Comment

Comments to any post are always welcome. I thrive on challenges and it will be more interesting for you.