Narayana team blog: October 2011

Thursday, October 20, 2011

pen or pencil for writing logs?

Those old enough to remember the days of manned space flight in the western world will recall that NASA expended considerable resources to come up with a pen that would work in space. Meanwhile, half a world away, the Russians just used a bunch of pencils.

I can't help feeling that story is going to have a new counterpart in the history books. Whilst HP are working on memristor based persistent RAM, someone else just grafted a bunch of flash and a super-capacitor onto a regular DIMM instead. Now I just need the linux guys to come up with a nice API...

New RAM shunts data into flash in power cuts

Tuesday, October 18, 2011

nested transactions 101

You wait ages for an interesting technical problem, then you get the same one twice in as many weeks. If you are a programmer, you now write a script so that you don't have to do any manual work the third time you encounter the problem. If you are a good programmer, you just use the script you wrote the previous week.

When applied to technical support questions, this approach results in the incremental creation of a FAQ. My last such update was way back in April, on the topic of handling XA commit race conditions between db updates and JMS messages. Like I said, you wait ages for an interesting problem, but another has finally come along. So, this FAQ update is all about nested transactions.

Mark has posted about nested transaction before, back in 2010 and 2009. They have of course been around even longer than that and JBossTS/ArjunaTS has support for them that actually pre-dates Java - it was ported over from C++. So you'd think people would have gotten the hang of it by now, but not so. Nested transactions are still barely understood and even less used.

Let's deal with the understanding bit first. Many people use the term 'nested transactions' to mean different things. A true nested transaction is used mainly for fault isolation of specific tasks within a wider transaction.


tm.begin();
doStuffWithOuterTransaction();
tm.begin();
try {
  doStuffWithInnerTransaction();
  tm.commit();
} catch(Exception e) {
  handleFailureOfInnerTransaction();
}
doMoreStuffWithOuterTransaction();
tm.commit();

This construct is useful where we have some alternative way to achieve the work done by the inner transaction and can call it from the exception handler. Let's try a concrete example:


tm.begin(); // outer tx
bookTheatreTickets();
tm.begin(); // inner tx
try {
  bookWithFavoriteTaxiCompany();
  tm.commit(); // inner tx
} catch(Exception e) {
  tm.begin(); // inner tx
  bookWithRivalTaxiFirm();
  tm.commit(); // inner tx
}
bookRestaurantTable();
tm.commit(); // outer tx

So, when everything goes smoothly you have behaviour equivalent to a normal flat transaction. But when there is minor trouble in a non essential part of the process, you can shrug it off and make forward progress without having to start over and risk losing your precious theatre seats.

As it turns out there are a number of reasons this a not widely used.

Firstly, it's not all that common to have a viable alternative method available for the inner update in system level transactions. It's more common for business process type long running transactions, where ACID is frequently less attractive than an extended tx model such as WS-BA anyhow. What about the case where you have no alternative method, don't care if the inner tx fails, but must not commit its work unless the outer transaction succeeds? That's what afterCompletion() is for.

Secondly, but often of greater practical importance, nested transactions are not supported by any of the widely deployed databases, message queuing products or other resource managers. That severely limits what you can do in the inner transaction. You're basically limited to using the TxOJ resource manager bundled with JBossTS, as described in Mark's posts. Give up any thought of updating your database conditionally - it just won't work. JDBC savepoints provide somewhat nested transaction like behaviour for non-XA situations, but they don't work in XA situations. Nor does the XA protocol, foundation of the interoperability between transaction managers and resource managers, provide any alternative. That said, it's theoretically possible to fudge things a bit. Let's look at that example again in XA terms.


tm.begin(); // outer tx
bookTheatreTickets(); // enlist db-A.
tm.begin(); // inner tx
try {
  bookWithFavoriteTaxiCompany(); // enlist db-B.
  tm.commit(); // inner tx - prepare db-B. Don't commit it though. Don't touch db-A.
} catch(Exception e) {
  // oh dear, the prepare on db-B failed. roll it back. Don't rollback db-A though.
  tm.begin(); // inner tx
  bookWithRivalTaxiFirm(); // enlist db-C
  tm.commit(); // inner tx - prepare db-C but don't commit it or touch db-A
}
bookRestaurantTable(); // enlist db-D
tm.commit(); // outer tx - prepare db-A and db-D. Commit db-A, db-C and db-D.

This essentially fakes a nested transaction by manipulating the list of resource managers in a single flat transaction - we cheated a bit by removing db-B part way through, so the tx is not true ACID across all the four participants, only three. JBossTS does not support this, because it's written by purists who think you should use an extended transaction model instead. Also, we don't want to deal with irate users whose database throughput has plummeted because of the length of time that locks are being held on db-B and db-C.

Fortunately, you may not actually need true nested transactions anyhow. There is another sort of nested transaction, properly known as nested top-level, which not only works with pretty much any environment, but is also handy for many common use cases.

The distinction is founded on the asymmetry of the relationship between the outer and inner transactions. For true nested transactions, failure of the inner tx need not impact the outcome of the outer tx, whilst failure of the outer tx will ensure the inner tx rolls back. For nested top-level, the situation is reversed: failure of the outer transaction won't undo the inner tx, but failure of the inner tx may prevent the outer one from committing. Sound familiar? The most widely deployed use case for nested top-level is ensuring that an audit log entry of the processing attempt is made, regardless of the outcome of the business activity.


tm.begin();
doUnauditedStuff();
writeAuditLogForProcessingAttempt();
doSecureBusinessSystemUpdate();
tm.commit();

The ACID properties of the flat tx don't achieve what we want here - the audit log entry must be created regardless of the success or failure of the business system update, whereas we have it being committed only if the business system update also commits. Let's try that again:


tm.begin(); // tx-A
doUnauditedStuff();
Transaction txA = tm.suspend();
tm.begin(); // new top level tx-B
try {
  writeAuditLogForProcessingAttempt();
  tm.commit(); //  tx-B
} catch(Exception e) {
  tm.resume(txA);
  tm.rollback(); // tx-A
  return;
}
tm.resume(txA);
doSecureBusinessSystemUpdate();
tm.commit(); // tx-A

Well, that's a little better - we'll not attempt the business logic processing unless we have first successfully written the audit log, so we're guaranteed to always have a log of any update that does take place. But there is a snag: the audit log will only show the attempt, not the success/failure outcome of it. What if that's not good enough? Let's steal a leaf from the transaction optimization handbook: presumed abort.


tm.begin(); // tx-A
doUnauditedStuff();
Transaction txA = tm.suspend();
tm.begin(); // new top level tx-B
try {
  writeAuditLogForProcessingAttempt("attempting update, assume it failed");
  tm.commit(); //  tx-B
} catch(Exception e) {
  tm.resume(txA);
  tm.rollback(); // tx-A
  return;
}
tm.resume(txA);
doSecureBusinessSystemUpdate();
writeAuditLogForProcessingAttempt("processing attempt completed successfully");
tm.commit(); // tx-A

So now we have an audit log will always show an entry and always show if it succeeded or not. Also, I'll hopefully never have to answer another nested transaction question from scratch. Success all round I'd say.

memristor based logs

Long time readers will recall that I've been tinkering with shiny toys in the form of SSDs, trying to assess how changes in storage technology cause changes in the way transaction logging should be designed. SSDs are here now, getting cheaper all the time and therefore becoming more 'meh' by the minute. So, I need something even newer and shinier to drool over...

Enter memristors, arguably the coolest tech to emerge from HP since the last Total-e-Server release. Initially intended to complete with flash, memristor technology also has the longer term potential to give us persistent RAM. A server with a power-loss tolerant storage mechamism that runs at approximately main memory speed will fundamentally change the way we think about storage hierarchies, process state and fault tolerance.

Until now the on-chip cache hierarchy and off-chip RAM have both come under the heading of 'volatile', whilst disk has been considered persistent, give or take a bit of RAID controller caching.

Volatile storage is managed by the memory subsystem, with the cache hierarchy within that tier largely transparent to the O/S much less the apps. Yes, performance gurus will take issue with that - understanding cache coherency models is vital to getting the best out of multi-core chips and multi-socket servers. But by and large we don't control it directly - MESI is hardcoded in the CPU and we only influence it with simple primitives - memory fencing, thread to core pinning and such.

Persistent storage meanwhile is managed by the file system stack - the O/S block cache, disk drivers, RAID controllers, on-device and on-controller caches etc. As more of it is in software we have a little more control over the cache model, by O_DIRECT, fsync, firmware config tweaking and such. Most critically, we can divide the persistent storage into different pools with different properties. The best known example is the age old configuration suggestion for high performance transaction processing: put the log storage on a dedicated device.

So what will the programming model look like when we have hardware that offers persistent RAM, either for all the main memory or, more likely in the medium term, for some subset of it? Will the entire process state survive a power loss at all cache tiers from the on-CPU registers to the disk platters, or will we need fine grained cache control to say 'synchronously flush from volatile RAM to persistent RAM', much as we currently force a sync to disk? How will we resume execution after a power loss? Will we need to explicitly reattach to our persistent RAM and rebuild our transient data structures from its contents, or will the O/S magically handle it all for us? Do we explicitly serialize data moving between volatile and non-volatile RAM, as we currently do with RAM to disk transfers, or is it automatic as with cache to RAM movements? What does this mean for us in terms of new language constructs, libraries and design patterns?

Many, many interesting questions, the answers to which will dictate the programming landscape for a new generation of middleware and the applications that use it. The shift from HDD to SSD may seem minor in comparison. Will everything we know about the arcane niche of logging and crash recovery become obsolete, or become even more fundamental and mainstream? Job prospects aside, on this occasion I'm leaning rather in favour of obsolete.