Saturday, April 2, 2011

The Good, The Bad and The Ugly of Metadata Management

Introduction
The simplest definition of Metadata would be 'data that describes other data'. It adds context, meaning and understanding to the data being described. Database\Data warehouse projects often have some form of metadata management associated with them. Metadata management in simple terms would be the continuous process of maintaining metadata (adding, updating, other maintenance activities based on the form of metadata storage). An example of metadata management would be during the data modeling phase where the entity and attribute definitions are maintained by tools such as Erwin, ER studio, etc.
For the scope of this blog I will confine myself to the boundaries of technical metadata discussion (specifically to data dictionary metadata).
Metadata can be broadly classified into three types:
1. Business Metadata: This metadata defines and describes the actual data existing in the database or data warehouse in a business sense. Ex: The column Quantity in table Orders refers to the total number of orders placed by a customer in a day.
2. Technical Metadata: This metadata stores information about the technical aspects of the data for objects like tables, columns, data types, data profiling, etl objects and so on. This typically is called the data dictionary (or System Inventory).
An example of data dictionary metadata would be Microsoft SQL Server’s metadata storage; SQL Server stores data about user objects and system objects (pertaining to SQL Server) in a set of system databases – ‘master’, ‘model’ and ‘msdb’. The master database stores data about all the user objects (which include tables, indexes, constraints, stored procedures, functions, views, etc.) in a set of internal system tables. Ex: sysobjects, syscolumns, sysconstraints and other tables brilliantly exposed by INFORMATION_SCHEMA views. Note: These system tables also store data about themselves.
3. Process Execution Metadata: Data about the actual execution of the ETL processes – like performance statistics, rows transferred, transfer duration, errors\exceptions encountered, logging etc. In a way 3 is also technical metadata, but I like to branch it under a separate category due to the nature of the data.
The Why?
The ‘why’ of this blog is to (try to) understand the impact of purpose & use of the metadata stored in the metadata repository. Also will discuss the reasons for aligning your repository design with the purpose of metadata. Just to be clear I myself am a big supporter of introducing metadata in the overall architecture of a system and also to an extent on metadata driven architectures. The buck stops at the realization point when the effort being invested in “making it work” is more than the “actual work” itself.
Now that we have a fairly good enough understanding of what metadata is and its different forms of storage, lets move on to a case of building metadata solution for a database migration project.
The (Client’s) Treasure
Let us consider a case of designing a new enterprise data warehouse project and also a metadata solution which is an integral part of this effort. (I am relating this to a project that I worked on in the past where I had approximately one thousand DB2 tables which needed a transfer to SQL Server database at a major automobile dealership chain sharing some experience). Situation: Create a daily load from DB2 tables into the new SQL Server data warehouse. Assuming that our solution has the ETL part (SSIS) and also the reporting and analysis piece (SSAS and SSRS) to it, we come up with a metadata repository (a common sql server database) for the logging and auditing framework for the ETL components (process metadata), a key value pair storage structure for column definitions, ETL object definitions, report definitions, etc. (a common structure for business and technical metadata for specific work identified above).
The Good
From the descriptions of the types of metadata being stored the purpose can be inferred as operational statistics\reporting (of scheduled jobs, ETL processes), business definitions of columns to serve as a data dictionary for years to come.
As long as the purpose and the use (at the most granular level) is specifically understood, the pursuit towards that type of metadata storage can be justified.
Now lets assume the creation of a raw or staging area for our warehouse where the table structures are going to be almost the same as that of the source and given that we have the metadata of the source system (table name, column name, data type, length, etc.) we should be able to create a similar structure in our SQL Server warehouse environment by writing some looping programming constructs to create all staging tables dynamically.
Wow, isn't that awesome? Brag time: I was able to create a ‘simple’ script that generated a staging area with whopping 1000 tables plus their keys. Manually doing it is just not feasible. Now let us compare the effort involved in creating the script\process that did the work (less than 4 hrs.) instead of manually creating each table which if you estimate is a lot of $$$ and not being smart (“making it work” vs. “actual work”).
The (Not so) Bad 
The maintenance and constant upkeep of metadata is the ‘not so bad’. The reason being that although it does end up taking time and effort to keep it up to date, but doing so keeps the metadata under check. There will be the occasional exception that will throw the metadata values out of whack an may cause your automated process to fail. Try to fix the metadata, not the process in this case. Example: some of the datatypes being sent by the DB2 metadata dump are mistyped as ‘datetimer’ instead of 'datetime’ (extra ‘r’ at the end causing the automated script to fail). This is a very simple example. Think of other instances where you are creating primary keys and foreign keys from the metadata. Sure, does seem like a good idea and indeed is if the metadata is in a structured format. And yes, the process can also be fixed or restructured to address this; but, a well defined and well know process already exists for something like this and it is called ‘Data Modeling’. This gets into the ‘Ugly part discussed below’.
Remember the purpose of the metadata store – it was just to serve as a data dictionary in the simplest form. The ability to do something extra out of it was a by-product.
The (Definitely) Ugly
The ability to script and generate the staging area was possibly only due to the well structured metadata from the DB2 system, but if it were not, then the effort would be different – spending countless number of hours to get the DB2 metadata in a “perfect” structure which could possibly take more time than the actual work of just creating the staging area manually one table at a time. The perfect structure would be to mimic the storage structures which SQL Server stores its metadata in a very normalized format that validate all the data being stored. This takes you on a quest for finding a complex solution.
Your metadata repository structure does not have to be perfect, it just needs to serve its purpose. If you end up going down the path of making it absolutely right, then this is just over-engineering something simple.
The Summary
Remember that your metadata does not have to be a panacea (it is not, so try not to push too much toward making it one). Define the purpose of a certain piece of metadata, consider the administration\maintenance aspect that will be spent toward it and just drive toward achieving that purpose. There is a thin line between the bad and the ugly. As a continuous practice keep your metadata solution the ‘good’ or the ‘not so bad’ band and your metadata solution will have met its purpose.
The metadata you pursue will give you a lot of information and every time you work with the metadata repository created you will get newer ideas on using it for a number of things (possibly even fueling a rocket, watch out for such instances and prevent yourself from designing a rocket fueling system).
Do NOT let the the metadata implementation & management overshadow the actual system implementation and your immediate goals.
That is all I have for now. Thanks for reading.
Fun fact: The total number of time the word ‘metadata’ appears in this article = 50 :-)