Narayana team blog: July 2011

Friday, July 22, 2011

norecoveryxa

"Infamy! Infamy! They've all got it in for me!"

Of all the error and warning messages in JBossTS, none is more infamous than norecoveryxa. Pretty much every JBossTS user has an error log containing this little gem. Usually more than once. A lot more. But not for much longer. It has incurred my wrath and its days are numbered. I've got it in for that annoying little beast.


[com.arjuna.ats.internal.jta.resources.arjunacore.norecoveryxa]
Could not find new XAResource to use for recovering non-serializable XAResource

Transaction logs preserve information needed to complete interrupted transactions after a crash. Boiled down to pure essence, a transaction log is a list of the Xids involved in a transaction. An Xid is simply a bunch of bytes, beginning with a unique tx id and ending with a branch id that is (usually) different for each resource manager in the transaction. During recovery, these Xids are used to tell the resource managers to commit the in-doubt transactions. All of which is fine in theory, but an utter pain in practice.

The core problem is that an Xid is not a lot of use unless you can actually reconnect to the resource manager and get an XAResource object from its driver, as it is the methods on that object to which you pass the Xid. So you need some way of storing connection information also. In rare cases the RM's driver provides a Serializable XAResource implementation, in which case the simplest (although not necessarily fastest) thing to do is serialize it into the tx log file. Recovery is then easy - deserialize the XAResource and invoke commit on it. Well, except for the whole multiple classloaders thing, but that's another story. Besides, hardly any drivers are actually so accommodating. So, we need another approach.

To deal with non-serializable XAResources, the recovery system allows registration of plugins, one per resource manager, that provide a new XAResource instance on demand. In olden times configuring these was a manual task, frequently overlooked. As a result the recovery system found itself unable to get a new XAResource instance to handle the Xid from the transaction logs. Hence norecoveryxa. That problem was solved, at least for xa-datasource use within JBossAS, by having the datasource deployment handler automatically register a plugin with the recovery system. Bye-bye AppServerJDBCXARecovery, you will not be missed.

That leaves cases where the resource manager is something other than an XA aware JDBC datasource. To help diagnose such cases it would be helpful to have something more than a hex representation of the Xid. Whilst the standard XAResource API does not allow for such metadata, we now have an extension that provides the transaction manager with RM type, version and JNDI binding information. This gets written to the tx logs and, in the case of the JNDI name, encoded into the Xid.

As a side benefit, this information allows for much more helpful debug messages and hopefully means I spend less time on support cases. Didn't really think I was doing all this just for your benefit did you?

before:


TRACE: XAResourceRecord.topLevelPrepare for < formatId=131076,
gtrid_length=29, bqual_length=28, tx_uid=0:ffff7f000001:ccd6:4e0da81f:2,
node_name=1, branch_uid=0:ffff7f000001:ccd6:4e0da81f:3>

after:

TRACE: XAResourceRecord.topLevelPrepare for < formatId=131076,
gtrid_length=29, bqual_length=28, tx_uid=0:ffff7f000001:ccd6:4e0da81f:2,
node_name=1, branch_uid=0:ffff7f000001:ccd6:4e0da81f:3>,
product: OracleDB/11g, jndiName: java:/jdbc/TestDB >

The main reason we want the RM metadata though is so that when recovery tries and fails to find an XAResource instance to match the Xid, it can now report the details of the RM to which that Xid belongs. This makes it more obvious which RM is unavailable e.g. because it's not registered for recovery or it is temporarily down.


WARN: ARJUNA16037: Could not find new XAResource to use for recovering
non-serializable XAResource XAResourceRecord < resource:null,
txid:< formatId=131076, gtrid_length=29, bqual_length=41,
tx_uid=0:ffff7f000001:e437:4e294bb6:6, node_name=1,
branch_uid=0:ffff7f000001:e437:4e294bb6:7, eis_name=java:/jdbc/DummyDB >,
heuristic: TwoPhaseOutcome.FINISH_OK,
product: DummyProductName/DummyProductVersion,
jndiName: java:/jdbc/DummyDB>

Despite the name change (part of the new i18n framework), that's still our old nemesis norecoveryxa. It's made of stern stuff and, although now a little more useful, is not so easily vanquished entirely.

Even with all the resource managers correctly registered, it's still possible to be unable to find the Xid during a recovery scan. That's because of a timing window in the XA protocol, where a crash occurs after the RM has committed and forgotten the branch but before the TM has disposed of its no longer needed log file. In such cases no RM will claim ownership of the branch during replay, leaving the TM to assume that the owning RM is unavailable for some reason. With the name of the owning RM to hand, we now have the possibility to match against the name of the scanned RMs, conclude the branch belongs to one of the scanned RMs even though it pleaded ignorance, and hence dispose of it as a safely committed branch.

All of which should ensure that norecoveryxa's days in the spotlight are severely numbered.

Friday, July 8, 2011

rethinking log storage

Transaction coordinators like JBossTS must write log files to fault tolerant storage in order to be able to guarantee correct completion of transactions in the event of a crash. Historically this has meant using files hosted on RAIDed hard disks.

The workload is characterised by near 100% writes, reads for recovery only, a large number of small (sub 1 KB) operations and, critically, the need to guarantee that the data has been forced through assorted cache layers to non-volatile media before continuing.

Unfortunately this combination of requirements has a tendency to negate many of the performance optimizations provided in storage systems, making log writing a common bottleneck for high performance transaction systems. Addressing this can often force users into unwelcome architectural decisions, particularly in cloud environments.

So naturally transaction system developers like the JBossTS team spend a lot of time thinking about how to mange log I/O better. Those who have read or watched my JUDCon presentation will already have some hints of the ideas we're currently looking at, including using cluster based in-memory replication instead of disk storage.

More recently I've been playing with SSD based storage some more, following up on earlier work to better understand some of the issues that arise as users transition to a new generation of disk technology with new performance characteristics.

As you'll recall, we recently took the high speed journalling code from HornetQ and used it as the basis for a transaction log. Relatively speaking, the results were a pretty spectacular improvement over our classic logging code. In absolute terms however we still weren't getting the best out the hardware.

A few weeks back I got my grubby paws on one of the latest SSDs. On paper its next generation controller and faster interface should have provided a substantial improvement over its predecessor. Not so for my transaction logging tests though - in keeping with tradition it utterly failed to outperform the older generation of technology. On further investigation the reasons for this become clear.

Traditional journalling solutions are based on a) aggregating a large number of small I/Os into a much smaller number of larger I/Os so that the drive can keep up with the load and b) serializing these writes to a single append-only file in order to avoid expensive head seeks.

With SSDs the first of those ideas is still sound, but the number of I/O events the drive can deal with is substantially higher. This requires re-tuning the journal parameters. For some usage it even becomes undesirable to batch the I/Os - until the drive is saturated with requests it's just pointless overhead that delays threads unnecessarily. As those threads (transactions) may have locks on other data any additional latency is undesirable. Also, unlike some other systems, a transaction manager has one thread per tx, as it's the one executing the business logic. It can't proceed until the log write completes, so write batching solutions involve significant thread management and scheduling overhead and often have a large number of threads parked waiting on I/O.

There is another snag though: even where the journalling code can use async I/O to dispatch multiple writes to the O/S in parallel, the filesystem still runs them sequentially as they contend on the inode semaphore for the log file. Thus writes for unrelated transactions must wait unnecessarily, much like the situation that arises in business applications which uses too coarse-grained locking. Also, the nice ncq for the drive remains largely unused, limiting the ability of the drive controller to execute writes in parallel.

The serialization of writes to the hardware, whilst useful for head positioning on HDDs, is a painful mistake for SSDs. These devices, like a modern CPU, require high concurrency in the workload to perform at their best. So, just as we go looking for parallelization opportunities in our apps, so we must look for them in the design of the I/O logging.

The most obvious solution is to load balance the logging over several journal files when running on an SSD. It's not quite that simple though - care must be taken to avoid a filesystem journal write as that will trigger a buffer flush for the entire filesystem, not just the individual file. Not to mention contention on the filesystem journal. For optimal performance it may pay to put each log file on its own small filesystem. But I'm getting ahead of myself. First there is the minor matter of actually writing a prototype load balancer to test the ideas. Any volunteers?