Wednesday, October 6, 2010

Agile Data warehouse Planning & Implementation with Hudson, NANT, Subversion and Visual Studio Database Projects (Part 1 of 4)

Overview
The notion of managing data warehouse projects with continuous integration with open source technologies is an uncommon practice or i guess is just unpopular in IT shops dealing with database code, SSIS and SSAS projects (from my experience). Excuses\opinions differed from company to company:

• “It doesn’t apply to database code projects”
• “What is Continuous Integration and how does it apply to data warehouse projects?”
• “Here at Acme Inc. change control is done by our architect\DBA who uses a tool called ‘AcmeErwin’ or AcmeVisio to generate code, so we don’t need the additional bells and whistles”
• “Automating testing & deployment for database projects, SSIS packages is not possible”
• “Is it worth the effort?”
• “We are special, we do things differently & we don’t like you.” – kidding about this one.

In this article I will try to justify the use of CI on data warehouse projects and try to address the concerns above. The subject matter of this article is geared towards planning and implementing data warehouse projects with agile development practices on the lines of iterative feature\perspective driven development. (Perspective = Subject Area = Star Schema) The article begins with an introduction to agile development practices, reviewing evolutionary database design, defining continuous integration in the context of database development, comparing viewpoints of waterfall and JAD methodologies to Agile, and demonstrating the coupling of the Kimball approach with Agile to establish a framework of planning long term project milestones comprised of short term visible deliverables for a data mart\warehouse project. I will do a detailed walkthrough of setting up a sample database project with the technologies (VS Database Projects for managing code, Hudson for Continuous database integration, NANT for configuring builds, subversion for source control and OSQL for executing command line SQL) is included for demonstration .

Due to the verbose nature of my take on this, I am writing this article as a 4 part series. Trust me; the next 3 parts are going to be hands on cool stuff.
Part 1 – An introduction to agile data warehouse planning & development and an introduction to Continuous Database Integration (CDBI).
Part 2 – Create the database project with Visual Studio Database Projects & Subversion.
Part 3 – Prepare the build machine\environment with Hudson and NANT
Part 4 – Making the medley work. Proof is in the pudding.

Introduction
After shifting gears on different approaches to database development at various client sites and compiling the lessons learnt, I am close to applying a standardized methodology for the database development and management. One can apply this approach on projects regardless of size and complexity owing to its proven success.

Before starting the introduction to Continuous Integration and Agile, let me take a step back and give you a lesson learnt working with the waterfall model. While working on a new data warehouse project and adopting the waterfall SDLC approach for database project planning and implementation, over time, the implementation plan did not follow the estimated planning. Sure, it was only an estimate, but you don’t want these estimates changing forever. Here the implementation was almost always off track when compared to the initial plan. This observation was initially not visible during the initial planning phase, but over time the mismatch was more evident during the development cycle. The mismatch was due to ‘change’. These were changes in requirements or caused due external factors. When you approach a data warehouse development methodology it is either the Kimball approach or the Inmon approach; and for my project it started off with the waterfall + Kimball. But, due to the nature of the requirements from the business where changes were too frequent, the waterfall was proving to be a showstopper. The reaction to changes and turnaround time required by the development team was slowing down the project timeline. The requirements were changing plus this project was already a mammoth effort with more than ten subject areas with conformed dimensions to form a data mart.

The old school approach on starting a new database project (with the waterfall cycle) begins with initial requirements and then comes in the logical model and then the physical model. Usually In this approach you have an ‘architect’ or a ‘data modeler’ or an application DBA on the project who owns the schema and is responsible for making changes from inception to maturity. This methodology is almost perfect and everyone starts posting those ER diagrams on their walls and showing off, until Wham! Requirements start changing and for every change you need to go back, change the specification, the schema and then the actual code behind, and of course the time for testing. As the frequency of these changes goes up, this catapults the delivery dates and changes your project plan. In this approach, the turnaround time for delivering an end product with the changes identified is just not feasible. I am not debating that using a waterfall approach will determine success or failure; I am trying to juxtapose the effort involved (basically showing you the time spent in the spiral turnover of the waterfall model then will compare it with an agile approach). This is a classic example of how the traditional waterfall approach hinders the planning and implementation of your project.

This called for a need for a change in the development approach, one that could react quickly to any change that could affect the time-line of deliverables. The new approach adopted was an agile development practice + the Kimball method which resulted in a successful implantation of a large scale data mart for a health care company. By now you should have an idea on what I am trying to sell here.

What is Continuous Integration?
Continuous Integration is all about automating the activities involved in releasing a feature\component of software and be able to simulate the process in a repeatable manner to reduce manual intervention and thus improving quality of the product being built. This set of repeatable steps typically involves running builds (compiling source code), unit testing, integration testing, static code analysis, deploying code, analyzing code metrics (quality of code, frequency of errors) etc.
Continuous Integration for database development is the ability to build a database project (a set of files that make up your database) in a repeatable manner such that the repeatable action mimics the deployment of your database code. Depending on the database development structure database projects can successfully be set up to run scheduled builds, automate unit testing, start jobs and deploy code to different environments or stage it for deployment to reduce human error that a continuous and monotonous process can bring.

The CDBI (Continuous Database Integration) Environment
The best part about the tools I am going to set up the continuous integration environment is that they are all FREE. Almost all are free except Visual Studio Team Edition for Database Professionals. I would highly recommend using VS Team Edition for Database Professionals as a tool for developing and managing code when working with database projects. The others tools that I used for continuous integration is Hudson, Subversion and NANT. Yes, freeware used mostly by the othe other community. But after applying them side by side with MS technologies, that proved to be a good mix.

All of the above opinions when drilled down, point to the concept of ‘done’ or to ‘a deliverable’. The granularity of the deliverable is pivotal to incremental software development. It is just a matter of perspective – for incremental software development the granularity of your deliverable is much smaller and the visibility is much clear and concise when compared to a deliverable on the waterfall track. On the continuous sprints you know for sure what needs to be delivered by the next sprint\iteration. At this point you have a definition of ‘done’. This is the most important thing when we start getting into agile development practices – The concept of done. [TechEd Thanks]
The more tasks are granular, easier they become to control and complete. Once you start slacking on a few, they then to pile up and that happens on a larger scale, they fog the plan even worse. Once you step in to the shoes of a project planner and also of a lead, this gap will become more evident and clear.

This is where CI helps in meeting deadlines, showing progress of work in regular sprints where the previous sprint progress is evaluated (to validate the concept of done) and requirement for the next sprint is defined.

Summary
To summarize, Continuous Integration in a DB environment is all about developing your database code in sprints (of two weeks or more, your choice), by a feature or perspective. Ex: A feature in the AdventureWorks database would be HR module or the Sales module. An example in a data warehouse environment could be the Inventory star schema. It is these short sprints (regular intervals of feature completion or deliverable, usually 2 weeks) of clear quantifiable requirements (definition of ‘done’) that helps gauge the status of work. Once the developers adapt to this rapid SDLC, visibility into the progress of the work goes up, results in accountability and ownership of work, building a more cohesive team and increased productivity (I can bet on this one) and most important of all, a quality product being delivered in chunks to form the big picture. The big picture being a collection of perspectives(start schemas) that plug together to form a data warehouse. That is all I have for now, more to follow in my part 2, 3 and 4 on setting up the CI environment with a database project, using Subversion as source control system and NANT for creating build files.

Thanks for reading & stay tuned ….

Vishal Gamji