Scott Ambler and Pramod Sadalage wrote Refactoring Databases, they say, "to share their experiences and techniques at evolving database schemas via refactoring". The book, particularly in the thorough list of refactorings detailed in later chapters, reveals them to be experienced users of, and writers about, agile development approaches. Their core premise is that data and databases must evolve in the same way as code does – that is incrementally.
They argue persuasively that a big-bang, up-front approach to database design is unacceptable in a modern environment. There is simply too much change and too much uncertainty in the business world for this to be realistic. The basic techniques for evolutionary database design include refactoring (the topic of the book), evolving the data model, database regression testing and configuration management and developer sandboxes. This more evolutionary approach is going to be a big shock for many data professionals, something the authors note, but I think the need for effective evolution and ongoing development of applications and thus their databases is compelling. "Change time", the period after an application (or database) is first deployed is bar far the majority of the life of an application. Techniques that help you cope with this, like database refactoring, are a good thing. Database refactoring as described in the book, is part of an evolutionary approach and with development teams taking on SCRUM, XP and other agile methods it is more important than ever for database teams to do likewise. Many data professionals will likely have the same knee-jerk reaction I did when first approaching this - Why not just get it right up front? But if you believe that agile model-driven development is here to stay for code then you have to accept the need for the same approach to database design.
Martin Fowler’s original "Refactoring: Improving the Design of Existing Code" book made the point that a refactoring must retain the behavioral semantics of the code and this is just as true in databases. The authors take great pains to explain refactoring in enough detail that it you can apply it constantly to keep the database as clean and easy to modify as possible. They emphasize repeatedly the value of test-driven or test first development – even in database design and deployment. The authors stress the importance of testing, especially regression testing, of all the components that access a database when refactoring it. They advise making refactoring efforts small as well as test-driven. They point out that refactoring should be done as a series of small steps and that database develops must not succumb to the temptation to combine refactorings into larger, more complex efforts. The need to treat database designs, and even data, as artifacts subject to change control comes through loud and clear.
The concept of a potentially very long transition period in which both the old and new schemas are supported is a particularly good one. I worry about the organizational dynamics of having the old schema removed by a new team that does not remember the original refactoring but nothing else seems rational in a typical environment where many applications run against the same database. I also liked the paranoia of the authors, for instance in their suggestion to always run regression tests BEFORE refactoring to make sure the database is actually as you think it is!
While the book focused on refactoring, many of the hints and suggestions would be good for implementing real changes in business policy. For instance, one example used is the moving of a balance column from a customer to an account. This is a refactoring only as long as the rule is still one account per customer (the customer balance is set, as part of the refactoring, to be the same as their account balance). If you change your business model to allow multiple accounts for a customer then you must change the way a customer balance is calculated (among other things). While this is not a refactoring, the advice in the book on how to do refactorings would be very applicable and helpful.
The book is a surprisingly easy read for such a potentially dense subject. The book starts by describing the fundamentals of evolutionary database development and the basics of refactoring. A process overview, deployment notes and some best practices follow. These initial chapters, designed to be read in sequence, introduce and explain the topic well and have a nice little "What you have learned section" at the end. There were many worthwhile asides in the book as it covers these topics, like the comment that a need to document often means a need to refactor and another about making sure a refactoring is worthwhile before starting. After these introductory chapters, the book then goes (somewhat abruptly) into a series of chapters on various kinds of refactoring – structural, data quality, referential integrity, architectural, method and transformations. These chapters take a series of very specific refactorings ("Introduce Calculated Column", "Move Data", "Introduce Hard Delete") and describe them. The potential motivation, trade offs and implementation mechanics are defined for each. The refactorings are self-contained and, while this makes reading them as a set somewhat repetitive, it should make the book a much better reference guide for actual users. The refactorings (there are over 60) fall into various categories:
- Structural changes to tables or views to make them easier to work with or more closely matched to reality
- Data quality improvements
- Matching referential integrity checks to the business usage of the data
- Architectural changes to make data easier to access from outside programs
- Method or stored procedure changes that are basically code refactoring
There are also a set of related, non-refactoring, transformations described in similar style.
The book did not really touch on how you should consider data and database designs used in analytic models. The need to be able to keep historical data in a way that allows potentially years of activity to be analyzed as a set seems like it should be a constraint on refactoring. They also missed the potential value of using business rules in complex conditional-expressions or to replace database table lookups. In particular, some of the scenarios where logic is removed from the database and pushed to the applications cried out for the layering and inheritance of business rules.
These are minor quibbles. The book is well written, full of great examples and gives lots of good advice for both those trying to make analytics their core competitive advantage and those just trying to get a little more value from their data. You can buy the book here.