Problem Detection, Recovery, and Prevention…

May 05 2009

Problem Detection, Recovery, and Prevention…

Category: Software Development — Phil @ 7:51 pm

I have spent a lot of time discussing software principals and architecture over that last couple of years, and the subject of error handling and recovery always seems to always rear its ugly head. Most recently, in talking about moving from monolithic application deployments to a distributed, web service orientation, many developers seem to immediately go into paranoia mode. They only see two choices for working with distributed systems: the two phase commit or store everything approach, such that it can be resubmitted when the service is available. There is one other option which we cannot forget about, used by many applications regardless of their deployment, the traditional “logging, debug, retry” logic. Unfortunately, we developers like to add this code at the end of the development cycle, and usually as an after thought; but more on that later!

Let’s talk reality… When was the last time one of your systems actually failed due to some type of hardware or service failure? I honestly don’t recall a single failure. Considering today’s redundant hardware technologies, I believe it would be pretty hard to have a total system failure; provided we move away from those monolithic application deployments! I agree that some kind of isolation layer (protection) is required for systems outside of our operational control, such as remote web service. But, our internal network and applications should be a different discussion, one that assumes 100% availability. I realize that some systems can experience performance degradations, but that is also a blog for another day, not something that we should be “protecting” ourselves from. I believe that most of the bugs we encounter in the production environment are actually caused by incorrect or misunderstood requirements, and the the occasional developer contribution (bad code, insufficient testing, etc.), not by a system failure. Another interesting finding, many of the bugs are actually found in the logging and monitoring (prevention) code, rather than the true business logic. What or who are we actually protecting ourself from?

One of our top goals should be to produce the smallest, simplest amount of code possible to implement the solution. I feel there are two (2) fundamental problem types:

Show Stoppers. A show stopper is a dependency in which your system cannot perform its primary function. For example if you created a purchasing system that relied on an external pricing service, your system cannot function without live prices. Assuming that this pricing service is internal to your company, there is no need to protect your service/application from this scenario. The actual risk of this type of failure is not worth the amount of effort required to implement and maintain a possible fall-back solution.

Recoverable errors. Recoverable errors are software problems that can be captured and dealt with later, at an appropriate time. Examples might include a failed database update or the failure to update a logging service. If the transaction can be still be completed, without the end user being impacted, this is an acceptable approach. The assumption is that we don’t have tens or hundreds of this issues to deal with, as they can and will happen over time. We just need to ensure that we have a method for recovering from the problems, ensuring that the system can be put back in the proper state of integrity.

I think the relevance of this post is to ensure that “software issues” can be detected. If we can assume that collaborating internal systems are always available, we can eliminate most of the code to manage these dependencies. We should focus on a common design/implementation/strategy that can ensure the detection of issues. Looking across existing applications, it seems that each one has its own unique way of capturing and monitoring problems; with each team spending significant design and implementation effort attempting to create a robust method. Ironically, with all of the work developers do, many of these systems have problems that go undetected for hours,days and even weeks, without anyone ever knowing that an “issue” happened. It appears that we should spend more effort on a common monitoring strategy to detect problems; one that makes us aware of problems within minutes, not hours of the problem occurrence.

The second part of this approach is the ability to recover. There are probably some other important concepts that I have in my head, not part of this blog, which actually makes this all possible. I think for a system to be recoverable, it should be compartmentalized; such as smaller databases, each with its own functional domain areas; rather than a single, gigantic database with full referential integrity. By breaking the system (services and database) into smaller, self-contained “mini” environments, each detected failure has a smaller overall impact on the system, and ultimately makes it easier to recover, as the impact is now isolated to a distinct area of the system.

Hopefully, with ability to detect problems earlier, minimizing their impact, and simplifying the recovery process, should make the prevention of the problem easier to address. With smaller, simpler systems (services), it will be easier to locate and correct the offending code (because there is less code and data coupling), creating less risky changes which are easier to test and validate. Sounds like there is an overall win-win architecture in here somewhere… I sure hope I get to build a system with these principals and goals some day!!!!!