Saturday, March 12, 2011

Heuristics and why you need to know they can happen!

Imagine you walk into a bank and want to perform a transaction (banks are very useful things in transaction examples). That transaction involves you transferring money from one account (savings) to another (current. You obviously want this to happen with some kind of guarantee, so for the sake of this example let's assume we use an ACID transaction.

To ensure atomicity between multiple participants, a two phase commit mechanism is required: during the first (preparation) phase, an individual participant must make durable any state changes that occurred during the scope of the atomic transaction, such that these changes can either be rolled back (undone) or committed later once consensus to the transaction outcome has been determined amongst all participants, i.e., any original state must not be lost at this point as the atomic transaction could still roll back. Assuming no failures occurred during the first phase (in which case all participants will be forced to undo their changes), in the second (commitment) phase participants may “overwrite” the original state with the state made durable during the first phase.

In order to guarantee atomicity, the two-phase protocol is necessarily blocking. If the coordinator fails, for example, any prepared participants must remain in their prepared state until they hear the transaction outcome from the coordinator. All commercial transaction systems incorporate a failure recovery component that ensures transactions that were active when a failure occurred are eventually completed. However, in order for recovery to occur, machines and processes obviously need to recover! In addition, even if recovery does happen, the time it takes can be arbitrarily long.

So, in our bank example, despite the fact that we're using transactions and assuming that the transaction system is reliable, certain failures will always occur, given enough time and probabilities. The kinds of failure were interested in for this example are those that occur after the participants in the two-phase commit transaction have said they will do the work requested of them (transfer the money) i.e., during the second (commit) phase. So, the money has been moved out of the current account (it's really gone) and is being added to the savings account, when the disk hosting the savings account dies, as shown in the diagram. Usually what this means is that we have a non-atomic outcome, or a heuristic outcome: the transaction coordinator has said commit, one participant (savings account) has said “done”, but the second one (current account) has said “oops!”. There's no going back with the work the current account participant has done, so this transaction isn't going to be atomic (all or nothing).

Imagine that this error happens and you don't know about it! Or at least don't know about it until the next time you check your account. Not good. Personally I'd like to know if there's been an error as soon as possible. In our bank scenario, I can go and talk to someone in the branch. If I was doing this via the internet there's usually a number I can call to talk to someone.

But fortunately most enterprise transaction specifications, such as the OMG’s Object Transaction Service and the X/Open XA specification, and implementations such as JBossTS allow for heuristics to occur. This basically means that the transaction system can be informed (and hence can inform) that such an error has happened. There's not a lot that can be done automatically to fix these types of error. They often require semantic information about the application in order to restore consistency, so have to be handled by a system administrator. However, the important thing is that someone knows there's been a problem.

No comments: