Keith Bloom: 2012

Sunday, September 23, 2012

SQL Baseline has joined the ChuckNorris Framework

I am very pleased to say that Rob and Dru have added my SQL Baseline tool to the Chuck Norris Framework. As part of SQL Baseline’s inauguration it has been renamed as PoweruP to fit alongside the likes of RoundhousE, DropkicK and WarmuP. The project has been moved over and can be found here.

I created PoweruP to help me configure RoundhousE to manage a number of existing databases. This is not an easy task and can be a barrier which stops people trying out RoundhousE as is shown by this conversation

@dantup @monkeyonahill @davidfowl @jabbr and intro @keith_bloom's sql-baseline. github.com/chucknorris/sq… (FYI, name change pending) :D
— Rob Reynolds (@ferventcoder) July 19, 2012

This is a shame because once RoundhousE is setup it greatly increases development speed, it is simple to maintain and brings database development inline with application coding. What can stop people using it, is the need to extract all the stored procedures, views, functions, etc, from the database. With one command PoweruP will scaffold a new RoundhousE project from an existing database. Plus It will create the scripts and put them in the default RoundhousE folder structure. For a more detailed explanation see this post.

I am very pleased for PoweruP to be part of the Chuck Norris framework. I hope it will help more development teams to get started using RoundhousE because it is the best tool I have found for managing changes to the database schema.

Tuesday, September 18, 2012

Using 0MQ to communicate between threads

In this post I show how 0MQ can help with concurrency in a multithreaded program. To do this, I explore what concurrency means and why it is important. I then focus on in-process concurrency and threaded programming, a topic which is notoriously tricky to do well due to the need to share some kind of state between threads. I explore why this is and how this is typically tackled. I will then show how communication between threads can be achieved without sharing any state using 0MQ. Finally I propose that by constructing our multi-threaded applications using the 0MQ model, that this leads us to more succinct and simpler code.

All code can be found in this github project

What is a concurrent program?

The word concurrent means more than one thing working together to achieve a common goal. In computing this means doing one of two things; something which is computationally expensive, like encoding a video file, or something that requires some sort of IO, like retrieving the size of a number of web pages.

The opportunity to employ concurrency has exploded with the arrival of multicore processors and the rise of hosted processing platforms like Amazon EC2 and Windows Azure. These two changes represent the two ends of the concurrency spectrum. To achieve concurrency on a multicore processor we create threads within our application and manage how they will share state. Whereas achieving concurrency using something like EC2 is network based and requires the use of a communication channel like TCP. When communicating over the network, state is handled by passing messages.

0MQ recognises that the best way to create a concurrent program is to pass messages and not to share state. Whether it is two threads running within a process or thousands of processes running across the internet, 0MQ uses the same model of sockets and messaging to create very stable and scalable applications.

Multiple threads shared state and locks

In .Net any program that must do more than one task at a time must create a thread. Threads are a way for Windows to abstract the management of many different streams of execution. Each thread gets it’s own stack and set of registers. The OS will then handle which thread is to be executed at one time.

The problem with threads is that when they have to communicate with each other the typical way is to share some value in memory. This can cause data corruption as more than one thread could be accessing the data at one time, so the application has to manage access to the shared data. This is done by locking the shared data, ensuring that only one thread can manipulate it at any one time. This mechanism adds complexity to an application as it must include the locking logic. It also has an effect on performance.

0MQ multiple threads and no shared state

0MQ makes threaded programming simpler by swapping shared state for messaging. To demonstrate this I have created a simple program which calculates the size of a directory by adding up the size of each file it has.

As we are using 0MQ we have to understand some of the concepts it uses. The first concept is static and dynamic components. Static components are pieces of infrastructure that we can always expect to be there. They usually own an endpoint which can be bound to. Dynamic components come and go and generally bind to endpoints. The next concept is the types of sockets provided by 0MQ. The implementation we’ll be looking at uses two types of sockets, PUSH and PULL. The PUSH socket is designed to distribute the work fairly to all connected clients, whilst the PULL socket collects results evenly from the workers. Using these socket types prevents one thread from being flooded with tasks or left idle waiting for it’s result to be taken.

Finally the 0MQ guide has a number of patterns for composing an application depending on the type of work being done. The example below calculates the size of a directory by getting the size of each file and adding them together. To achieve this task in 0MQ, a good choice is the task ventilator pattern.

In the diagram each box is a component in our application and components communicate with each other using 0MQ sockets. There are two static components in this application, the Ventilator and the Sink. There will only be one instance of each in the application and they will run on the same thread. There is one dynamic component, the Worker. There can be any number of workers and each one runs on it’s own thread.

To calculate the size of the directory, the Ventilator is given a list of files from the directory. It sends the name of each one out on it’s message queue.

When the Sink is started, it is given a number of files to count the size of, in this instance we pass in the length of the array that we passed to the Ventilator. The Sink then pulls in the results from each of the workers and increments the running total for the size of the directory. When it has finished it returns the total size of the files found.

The Worker connects to the Ventilator and Sink end points and sits in an endless loop.

When a message arrives from the Ventilator it triggers an event which causes the Worker to read the file from the disk to find its size. When the operation completes the Worker publishes the size to the Sink’s end point.

All the components are brought together in the controlling program. We create a 0MQ context which will be shared with all the components. This is an important point when using 0MQ with threads, there must be a single context and it must be shared amongst all the threads. We then create instances of the Ventilator and Sink passing in the context.

Next we create five workers each on their own thread, again passing in the 0MQ context.

We do the work by building an array of files from our directory and passing this to the Ventilator. We tell the Sink how many results to expect and wait for the result to be returned.

When we have the final number we print it on the console. At no point in the process did any thread have to update a shared value.

Conclusion

In this post I investigated the programming challenges faced when dealing with concurrency, focusing on those specific to threaded concurrency. I have shown how 0MQ approaches this problem with the view that concurrency should never involve sharing state and communication is best handled by passing messages between processes. To demonstrate how this works I created a simple program to calculate the size of a directory and used the 0MQ task ventilator pattern to structure the program. By following this pattern the software is broken down into very specific parts to perform a job. All knowledge of how to read the size of a file is held in the worker. If we discover a better way to read the size of the file this component can be changed without any impact on the rest of the program. This isolation is a consequence of only allowing communication between the key components over a message channel. Therefore the code is simpler as each component does only one job.

All code can be found in this github project

Sunday, August 19, 2012

0MQ Introduction

What is 0MQ?

0MQ is a very simple library that is used for managing the communication between different processes. It is a way of using enterprise messaging patterns without the need for an enterprise messaging server. By removing the server and using the socket API, a level of complexity is removed which leads to a simpler model good for concurrent programming.

History

0MQ has its roots firmly in the world of financial services. Originally, there were two vendors, TIBCO and IBM, which each had their own protocols for enterprise messaging. This made it hard for banks to intercommunicate. In 2003 the London office of JP Morgan created the first draft of Advanced Message Queue Protocol (AMQP) which was an attempt to create a standard communication protocol for messaging systems.

In 2005 iMatix were contracted by JP Morgan to create a message broker based on the new specification and they produced OpenAMQ. The new standard was received well by others in the financial services and new members were added to the working group. However, the complexity of AMQP grew and led to iMatix leaving the working group. In 2008 Pieter Hintjens of iMatix wrote What is wrong with AMQP and how to fix it. Here Hitchens applauds early versions of the specification for being concise and simple to implement but then criticised later versions for the complexity. He argues that any specification that is too complex will fail. It is also clear that the experience iMatix had developing OpenAMQ gave them good insight into a new way of supporting high speed messaging. This experience led them to conclude that the way to simplify messaging was to remove the server that hosted the queues for the clients. This led to the development of 0MQ.

Not a message bus

In traditional enterprise messaging there is a server which hosts the queues and roots the messages. If you are using IBM this maybe WebSphere, a Microsoft shop would use MSMQ, whilst others may use RabbitMQ. All of theses solutions involve some software being installed on a server. Clients then bind to the queues they host to process messages.

0MQ is different in that it does not have a central server component, it is just a software library. For network communications you write the server and client components using the 0MQ API. Internally 0MQ uses TCP sockets to create the connection. For a lot of scenarios this is removing a redundant step in the process. Take the example of a time server on the network whose job it is to respond to requests for current time. With an enterprise service bus my time server would bind to a queue on the central exchange. Any client that wanted to know the time would send a request to that queue and wait for a response. In this operation the central server is not adding much to the task. Using 0MQ this same server can be created very easily

The client that requests data from this service is

Applications that 0MQ is good for

By combining messaging patterns with socket based communication 0MQ is very good for concurrent programming. Concurrent programming can be across a network, within a machine or within a process. 0MQ uses the same patterns for all of these.

In the example above we created a server by binding to a tcp port timeServer.Bind("tcp://*:5555"); for a service hosted on a network. To host this in process or one a machine we just change the binding type:

A way to connect processes on any machine and pass messages between them:
Bind(“tcp://*:5555);
A way to connect processes within a machine and pass messages between them:
Bind(“ipc//:5555);
A way to create threads in a process and pass messages between them:
Bind(“inproc://myservice”);

Conclusion

0MQ is a very easy library to start using as it involves including a couple of DLL’s in your project and does not need any other infrastructure to support it. It has good abstractions and it is easy to create a variety of messaging queues.

Where I think 0MQ is really powerful is when it is applied to multi threaded programming. This is because 0MQ uses the same model for threaded programming as that used for concurrent programming across networks. Both of these pass messages to communicate instead of sharing state and this avoidance of shared state between threads leads to more reliable and simpler programs. I shall explain this fully in my next blog post.

Wednesday, May 16, 2012

Roundhouse with a legacy database

When putting an existing database schema into source control, often the biggest hurdle is how to start. In this post I will demonstrate how Roundhouse can be added to a normal Visual Studio solution. It can then be used to manage the database in the same place as you manage the code. This then brings the benefits modern development practices like Test Driven Development and Continuous Integration to database development.

I have created a fictitious scenario to help explain how to do this. I have been asked to create a simple interface for the Adventure Works database, apparently the old Visual Basic version is showing it’s age.

I've put all the code examples in a repository on github.

Structure

As I am starting a new project I have time to plan a good structure. I know that I will have source code, external tools and database backups for Roundhouse (more on these later). So I decide upon this tree:

With the folder structure complete, I can create the solution and add a project for the database schema. Here I am using a class library and have named it AdventureWorks.Database. I have also added two folders, db and deployment for the Roundhouse files.

Now to populate the db folder with the entities from the AdventureWorks and put them in to the relevant folders. My previous client had many legacy database with thousands of entities so I created sql-baseline to automate the process. Running sql-baseline requires three parameters, the server name, database name and place to put the scripts:

Sql-baseline has extracted the sprocs, views and functions and put them in the correct folders. The steps to add the files to Visual Studio are:

Click the “Show all file” icon in Solution Explorer
Expand the db folder
Select the sprocs, views and functions folder
Right click the selected folders and select “Include in project”

With the structure of the project setup, it is time to concentrate on working with Roundhouse.

The workflow

To support this process I have created a dos batch file, called LOCAL.DBDeployment.bat, to run Roundhouse. This is in the deployment folder.

The batch file executes the Roundhouse command line runner which is in the Tools folder. It sets the run mode to restore.

In a team environment this file would be omitted from source control allowing each developer to modify the variables. When executed it restores the database using the backup in the DatabaseBackups folder at the top of the repository tree.

To keep development as quick as possible I add an External Tool to Visual Studio with the following settings

The full command text is $(ProjectDir)\deployment\LOCAL.DBDeployment.bat. Running the migrations is now a menu click. I can also map a keyboard shortcut to make it quicker.

Following theses steps and conventions Roundhouse is integrated into the development process. I have found that being able to run the process from Visual Studio keeps the focus on development and keeps context switching to a minimum.

Roundhouse modes

I have configured Roundhouse to run in what I called “Restore mode”. Roundhouse has three options for running the migration scripts:

Drop and Create - Roundhouse will drop the database and recreate it from the scripts
Restore - Roundhouse will restore the database from a backup and then run the scripts
Normal - Roundhouse will run the scripts against the database

Drop and Create is typically used when developing a new database where the schema is developing rapidly but is not a good choice for an existing schema. In this situation we need confidence that the deployment to production will work and to achieve that we must develop using the current schema. This is why I use a backup of the production schema and have Roundhouse restore this before running the migration. In later posts I will explain how restore mode is used at every step in the deployment pipeline which rigorously tests the changes. Normal mode is used when the changes are finally deployed to the production database. When this is complete a new backup is taken and this is used as the new baseline during development.

Conclusion

There are many benefits of adding database development to source control.

For the development process it keeps all of the work together. When I relied on SQL scripts I would work on a feature in Visual Studio by adding tests and refactoring the code. This would be checked in and the code for the feature would evolve. A side task would be to maintain the migration scripts, deploying them and testing they work. With the database in source control each commit includes changes to the whole system; the source code and the database schema. Anyone viewing the commit will see the complete set of changes. For the person doing the work the development workflow is simpler. By including the database you can bring the benefits of test driven development to the database changes and they become part of the “Red, Green, Refactor” TDD cycle.

For the deployment process it constantly tests the database migration scripts. Previously this would be a manual task whereby each developer would test a single database script before the deployment. To test the script before deployment would be done on a shared server with a recent copy of the production schema. This method was prone to errors as another developer could have made a change to the production schema which the script was not tested against. By storing the changes in source control they are integrated during development so every developer is aware of them instantly. This brings the benefits of Continuous Integration to database development.

In my next post I will explain how Roundhouse can be added to the Continuous Integration process.

Wednesday, February 08, 2012

sql-baseline: a bootstraper for RoundhousE

Roundhouse is a great tool for putting the database in to source control. When Roundhouse is used on a project, changes to the schema become an after thought; every check-in updates the schema on the build server; every check-out updates the schema on all the developers PCs. When a deployment is run, the schema changes are applied automatically.

But the barrier to entry for controlling an existing database is high. Roundhouse needs all the stored procedures, views, etc as files in a folder structure. Each script must be able to create or alter the entity. For teams supporting databases with thousands of procedures and views this work will stop evaluating Roundhouse. This is a shame as they are missing out on a superb tool. So I have created something to help get you started.

What does it do?

My current client falls in to the category of having thousands of procedures and views which had to be scripted. So I created a tool to extract them from SQL Server.

https://github.com/chucknorris/powerup/

Sql-baseline is run with three options; server name, database name and location for the files:

.\sqlbaseline.exe -s:"(local)" -d:AdventureWorks -o:C:\Db\Adventure

It will generate a script for all of the stored procedures and views and put them in the default Roundhouse folders:

\db
  \sprocs
  \views

Each script is created with the Create if not Exists / Alter template which will check for the entity first and if it is missing create it. The files are then ready be run by Roundhouse.
I have a short list of updates which are coming soon

Export functions and indexes
Create all of the default Roundhouse folders
New option to specify which entities to create
Accept a set of custom folders

Conclusion

When I decided to trial Roundhouse my first hurdle was extracting the procedures and other entities in to the correct format. Using sql-baseline has enabled me to put three of my clients databases in to source control. I hope it helps others to start using Roundhouse and controlling changes to the database.

Tuesday, February 07, 2012

Tools for migrating the database

In my last post controlling the database I showed that putting the database schema in source control brings an easier work-flow for developers and more reliable deployments. In this post I will look at the tools available to automate the process. They fall in to two types, schema diff tools and migration tools.

Schema diff tools

Schema diffs tools work by comparing each object in source database to those in a target database. From this comparison, they produce a script. This script will update the target to match the source and includes adding and dropping columns in tables. Two examples of schema diff tools are Red Gate SQL Compare and Microsoft’s Visual Studio Database project.

Visual Studio Database project

The Visual Studio Database project adds the database to Visual Studio. The database schema then lives in the same solution as the application code. Visual Studio has a schema diff tool which can import an existing database. When it is finished there will be a file in the project for every object in the schema. When a change is made to one of the objects the project can be deployed to a target database to bring it up to date. Visual Studio creates an in memory representation of the database so it can support refactoring of database objects. This is clever and powerful. It provides developers with a lot of support and confidence when making changes to the schema.

I have worked on a project which used the Visual Studio Database. The database was new and it worked well overall. The use of the Visual Studio Database project made sure that all developers had a current copy of the schema and deployments were simpler. There are some downsides though.

Merge conflicts

As every table, key and index is represented by an individual file, it also needs an entry in the project file. This caused a merge conflict for the project file which was either very hard or impossible to resolve. The team’s solution was to put and exclusive lock on the project file so only one person could update.

Failed migrations

The second problem is due to the nature of a schema diff tool. They compare two schemas and produce a script to bring one in line with the other. As this is machine generated it does not examine how to move those schemas in-line with one another. For example, if a field changed from NULL to NOT NULL, what happens to the existing empty fields? When the Visual Studio Database Project encounters this, it stops the migration. The onus is then on each developer to find a work around. The Visual Studio Database project has a mechanism to avoid this in a series of scripts that are always run before or after a deployment. As these scripts are run all the time they can become very large and difficult to manage.

Cost

My final comment about the Visual Studio Database project is the cost. It is only available in the Premium and Ultimate editions. At the time of writing Amazon is selling Premium for £2,112 and Ultimate for £10,792.

Migrations tools

Schema migration tools use a series of scripts to manage the changes to a database. When the migration is run against the target database it applies all the changes which have been created since the last run. These tools do not extract changes from the database. Instead the developer must write a script to migrate the schema. Two examples are RoundhousE, Ruby on Rails migrations and Tarrantino. RoundhousE RoundhousE is an open source migration tool. It does not integrate with Visual Studio or provide the refactoring abilities of the Visual Studio Database project. But then it takes a different approach which removes the need for the heavy weight tooling. How does it do this? RoundhousE is based on SQL scripts kept in a set of directories. The directories are split in to two main groups; any time scripts and one time only scripts.

Any time scripts are procedures, views, functions and indexes. If the entity is missing from the database Roundhouse will run the script to create it. If the entity has changed it will run the script to update it.

One time scripts are what make RoundhousE a migration tool. They contain the SQL to change the schema for a particular iteration. Each file is prefixed with a number and they are run sequentially.

Migrations

To demonstrate this I have a simple example application to track students. The first task is to make a screen which captures the students name. As part of this task I create a table, but instead of creating the table on my local instance of SQL Server I add this script to the RoundHousE “up” folder, which is the default location for the one time scripts

File called 0001_Make_student_table.sql




CREATE TABLE Student 
  Id int NOT NULL,
  Name varchar(50)

The file name is important as RoundhousE will run the scripts in numerical order. Running RoundhousE against the local database executes the script and creates the table.

The task is compete and so the changes are committed to source control triggering a new build. The build server also starts of RoundhousE against it’s own instance of SQL Server, before running the unit tests. RoundhousE runs through the scripts in the “up” folder and applies all those since the last time it was run.

The next feature is to also capture the student’s email address. To meet the specification the student table needs a new column, so I add the following migration script:

File called 0002_Add_email_to_student_table.sql

ALTER TABLE Student ADD email varchar(100) NULL

I check in and the same process runs so the build server has also updated the Student table on it’s local database.

When another developer pulls the latest changes form source control, all they have to do is run Roundhouse and it will run the two new scripts and their schema is also up to date.

Deploying the changes follows the same process as the build server. The call to RoundhousE is added to the deploy script which updates the target schema. RoundhousE will also read a version number from a file and write this to the database. This is very useful, as it is possible to see that UAT is running version 1.0.44.1 while production is running 1.0.33.2.

Conclusion

Adding a tool to manage database changes makes development easier and deployments safer. With out it the developer has to remember to run any schema changes on the build server before committing the change, otherwise the tests will fail. Without a tool the person responsible for the deployment has to compile a large SQL script. As the various environments are managed by hand it maybe that UAT is out of step with Production. This means that the deployment script will have to be tested on a copy of the target environment before it is run. These complications lead to day long deployments. With a tool the changes to the database are all contained in the release package ready to be run. The tool will decide what needs to be updated since the last release. When a team works with this method for a while releases become more frequent because they are less work and more reliable.

Tuesday, January 03, 2012

Continuous Integration for the Database

Managing changes to the database is often an overlooked problem but it is one of the biggest issues a team can face. The traditional model of development whereby a DBA is the key holder of the schema is a thing of the past for many teams. Now each developer is responsible for making schema changes. Many of them will be working on local copies of the database, which is a good thing. However, it can fall apart when the change leaves the developers and moves towards production.

In this post I will explore why this causes a problem and suggest a better solution.

The change script

Most changes to a database schema are handled using a simple script. The typical arrangement is a script which includes all the changes for a release. The developer will then apply this to each database involved in the deployment process. First to the build server, then to test and finally to production. When the release is finished the script is stored on a common share or maybe in source control.

This method does not work well. When Continuous Integration is used correctly, every commit to the build server represents a version of the system which could be released. The check-in should include all changes that are required to deploy that version to any environment. This means including changes to the code base and to the database. If the change relies on a script being run independently, then it it is not atomic. If there is one thing that the DBMS has taught us, it is that all changes must be atomic.

Add to source control

The simple solution to this problem is to add the database schema to source control. This brings schema changes in-line with changes to the code base. For simple changes this may just add a column to the table. A more complex piece of work may involve a script which also migrates existing data.

Deploy with build

When using Continuous Integration, every check-in triggers a new build. The build server checks out the latest version, compiles it, runs the unit tests and reports back if the build passed or failed. In the shared scripts model the developer has to apply the changes to the build database before committing the changes. If not, then the build will fail. When the schema changes are part of the commit, applying them is managed by the build server.

So now the change is atomic. If the build succeeds it has verified that the code, schema changes, and the migration are error free. This build can now be deployed to the next step in the deployment pipeline. If the build fails the developer has some bugs to fix.

Deploy locally

One of the biggest problems with the shared script method for managing schema changes is keeping all developers local databases up to date. Often this is achieved by a change script being emailed around the team. The onus is then on each developer to run all the scripts to keep their database up to date. If they don‚Äôt, then the next version of the build may not run on their machine.

When all schema migrations are part of the build process there is no onus. When a developer pulls the latest version from source control they just run an update process to apply all the schema changes. In fact it could be the same process which is used to deploy the code base to the different environments. This way every developer is also exposed to the process which runs the deployment.

Making the process work

Like any new process it has to help the people using it. In this case the people who are responsible for maintaining the database schema. In some organisations this may be two separate groups; developers and database administrators (DBAs).

Developers may not currently have to worry about applying their changes to production or UAT. They could be sending a script off to a DBA who then does the work for them. Asking them to be responsible for managing their changes to the schema is asking them to do more work.

DBAs are used to controlling the database schema and being the guardians of change. By automating the changes they loose some control and have to adapt to a new way of working. I would argue that the role of the DBA has changed though. The view that the application code base and the database schema are two separate entities is flawed and outdated. A change to the code is a change to the code, regardless of whether it is written in C# or SQL. Therefore the DBA must make their changes using the same mechanism as the developers. They have to check-in changes in to the source repository and monitor the state of the build.

For both groups it enforces more collaboration. Every check-in communicates to the whole team what someone is working on and how they have approached it. For people new to Continuous Integration this can be daunting as the notion of a failed build can be unsettling. So care must be taken to communicate the benefits. When I work with teams I point out that a failed build is not a bad thing. The team has discovered that this version will not work in production and now they can fix it.

Once the team has engaged with the process one way to loose this positive effort is to have a build process which causes more problems than it solves. If the new way of working introduces more friction for the developers on a daily basis then it will be abandoned. If everyone has to spend an hour every day updating their local database then they will skip it. As it has to work well and be simple to use, the choice of tool to manage the schema changes is important. The tools fall in to two categories, schema diff tools and migration tools. Schema diff tools compare a source with a target schema. They find the differences and create a script which applies the changes from the source to the target. Migration tools run a series of scripts held on the file system which encapsulate a single change. Once all the scripts have been applied to the target it is in-line with the current version in source control.

In my next blog post I will explore the merits of both types of tool and describe my experiences with them.

Conclusion

Traditionally source control is the home for application code with the database being maintained separately. Historically this could even have been by a different team. The growth in popularity of Continuous Integration and testing has shown how flawed this approach is. It has become clear that change to the database must be synchronised with changes to the code base. To achieve this the schema must be managed in source control.

This is the goal but it is not easy to achieve. Special tools must be used and the team must embrace a different way of working. The pay off though is less regression errors, quicker development and more confidence that changes will work when deployed to production.