Wednesday, April 20, 2011

Messaging/Database race conditions

Programmers are divided between those who only write single threaded code and those who will, at the slightest hint of provocation, tell interminably long war stories about debugging race conditions. Although I'm firmly in the second category, this is not such a story. It is however a lesson in what happens when spec authors spend insufficient time listening to them.

A common pattern in enterprise apps involves a business logic method writing a database update, then sending a message containing a key to that data via a queue to a second business method, which then reads the database and does further processing. Naturally this involves transactions, as it is necessary to ensure the database update(s) and message handling occur atomically.

methodA {
beginTransaction();
updateDatabase(key, values);
sendMessage(queueName, key);
commitTransaction();
}

methodB {
beginTransaction();
key = receiveMessage(queueName);
values = readDatabase(key);
doMoreStuff(values);
commitTransaction();
}

So, like all good race condition war stories, this one starts out with a seemingly innocuous chunk of code. The problem is waiting to ambush us from deep within the transaction commit called by methodA.

Inside the transaction are two XAResources, one for the database and one for the messaging system. The transaction manager calls prepare on both, get affirmative responses and writes its log. Then it calls commit on both, at which point the resource managers can end the transaction isolation and make the data/message visible to other processes.

Spot the problem yet?

Nothing in the XA or JTA specifications defines the order in which XAResources are processed. The commit calls to the database and message queue may be issued in either order, or even in parallel.

Therefore it is inevitable that every once in a while, probably on the least convenient occasions, the database read in methodB will fail as the data has not yet been released from the previous transaction.

Oops.

So, having been let down by shortcomings in the specifications, what can the intrepid business logic programmer do to save the day?

Polling: Throw away the messaging piece and construct a loop that will look for new data in the db periodically. yuck.

Messaging from afterCompletion: The transaction lifecycle provides an afterCompletion notification that is guaranteed to be ordered after all the XAResource commits complete, so we can send the message from there to avoid the race. Also icky, since it is then no longer atomic with the db write in the case of failures, which means we need polling as a backup anyhow.

Let it fail: An uncaught exception in the second tx, e.g. an NPE from the database read, should cause the container to fail the transaction and the messaging system to do message redelivery. Hopefully by that time the db is ready. Inefficient and prone to writing a lot of failure messages into your system logs, which is to say: yuck.

Force resource ordering: Many transaction systems offer some non-spec compliant mechanism for XAResource ordering. Due to the levels of abstraction the container places between the transaction manager and the programmer this is not always easy to use from Java EE. It also ties your code to a specific vendor's container. You can do this with JBoss if you really want to, but it's not recommended.

Delay the message delivery: Similar to the above, some messaging systems go above and beyond the JMS spec to offer the ability to mark a queue or specific message as having a built-in delay. The messaging system will hold the message for a specified interval before making it available to consumers. Again you can do with with JBoss (see 'Scheduled Messages' in HornetQ) but whilst it's easy to code it's not a runtime efficient solution as tuning the delay interval is tricky.

Backoff/Retry in the consumer: Instead of letting the consuming transaction fail and be retried by message redelivery, catch the exception and handle it with a backoff/retry loop:

methodB {
beginTransaction();
key = receiveMessage(queueName);

do {
values = readDatabase(key);
if(values == null) {
sleep(x);
}
} while(values == null);

doMoreStuff(values);
commitTransaction();
}

This has the benefit of being portable across containers and, with a bit of added complexity, pretty robust. Yes it is a horrible kludge, but it works. Sometimes when you are stuck working with an environment built on flawed specifications that's really the best you can hope for.

2 comments:

Luciano said...

I know this is an old post, but I don't see why the read database should return null. Transaction A updates the database, so there is a write lock on that row. Transaction A commits so it sends the message. Transaction B was waiting for the message to arrive, and then it reads the database. It will wait until the lock is released, and then will read the new value.

Jones Morris said...
This comment has been removed by a blog administrator.