Friday, July 22, 2011

norecoveryxa

"Infamy! Infamy! They've all got it in for me!"

Of all the error and warning messages in JBossTS, none is more infamous than norecoveryxa. Pretty much every JBossTS user has an error log containing this little gem. Usually more than once. A lot more. But not for much longer. It has incurred my wrath and its days are numbered. I've got it in for that annoying little beast.


[com.arjuna.ats.internal.jta.resources.arjunacore.norecoveryxa]
Could not find new XAResource to use for recovering non-serializable XAResource


Transaction logs preserve information needed to complete interrupted transactions after a crash. Boiled down to pure essence, a transaction log is a list of the Xids involved in a transaction. An Xid is simply a bunch of bytes, beginning with a unique tx id and ending with a branch id that is (usually) different for each resource manager in the transaction. During recovery, these Xids are used to tell the resource managers to commit the in-doubt transactions. All of which is fine in theory, but an utter pain in practice.

The core problem is that an Xid is not a lot of use unless you can actually reconnect to the resource manager and get an XAResource object from its driver, as it is the methods on that object to which you pass the Xid. So you need some way of storing connection information also. In rare cases the RM's driver provides a Serializable XAResource implementation, in which case the simplest (although not necessarily fastest) thing to do is serialize it into the tx log file. Recovery is then easy - deserialize the XAResource and invoke commit on it. Well, except for the whole multiple classloaders thing, but that's another story. Besides, hardly any drivers are actually so accommodating. So, we need another approach.

To deal with non-serializable XAResources, the recovery system allows registration of plugins, one per resource manager, that provide a new XAResource instance on demand. In olden times configuring these was a manual task, frequently overlooked. As a result the recovery system found itself unable to get a new XAResource instance to handle the Xid from the transaction logs. Hence norecoveryxa. That problem was solved, at least for xa-datasource use within JBossAS, by having the datasource deployment handler automatically register a plugin with the recovery system. Bye-bye AppServerJDBCXARecovery, you will not be missed.

That leaves cases where the resource manager is something other than an XA aware JDBC datasource. To help diagnose such cases it would be helpful to have something more than a hex representation of the Xid. Whilst the standard XAResource API does not allow for such metadata, we now have an extension that provides the transaction manager with RM type, version and JNDI binding information. This gets written to the tx logs and, in the case of the JNDI name, encoded into the Xid.

As a side benefit, this information allows for much more helpful debug messages and hopefully means I spend less time on support cases. Didn't really think I was doing all this just for your benefit did you?

before:

TRACE: XAResourceRecord.topLevelPrepare for < formatId=131076,
gtrid_length=29, bqual_length=28, tx_uid=0:ffff7f000001:ccd6:4e0da81f:2,
node_name=1, branch_uid=0:ffff7f000001:ccd6:4e0da81f:3>


after:
TRACE: XAResourceRecord.topLevelPrepare for < formatId=131076,
gtrid_length=29, bqual_length=28, tx_uid=0:ffff7f000001:ccd6:4e0da81f:2,
node_name=1, branch_uid=0:ffff7f000001:ccd6:4e0da81f:3>,
product: OracleDB/11g, jndiName: java:/jdbc/TestDB >


The main reason we want the RM metadata though is so that when recovery tries and fails to find an XAResource instance to match the Xid, it can now report the details of the RM to which that Xid belongs. This makes it more obvious which RM is unavailable e.g. because it's not registered for recovery or it is temporarily down.


WARN: ARJUNA16037: Could not find new XAResource to use for recovering
non-serializable XAResource XAResourceRecord < resource:null,
txid:< formatId=131076, gtrid_length=29, bqual_length=41,
tx_uid=0:ffff7f000001:e437:4e294bb6:6, node_name=1,
branch_uid=0:ffff7f000001:e437:4e294bb6:7, eis_name=java:/jdbc/DummyDB >,
heuristic: TwoPhaseOutcome.FINISH_OK,
product: DummyProductName/DummyProductVersion,
jndiName: java:/jdbc/DummyDB>


Despite the name change (part of the new i18n framework), that's still our old nemesis norecoveryxa. It's made of stern stuff and, although now a little more useful, is not so easily vanquished entirely.

Even with all the resource managers correctly registered, it's still possible to be unable to find the Xid during a recovery scan. That's because of a timing window in the XA protocol, where a crash occurs after the RM has committed and forgotten the branch but before the TM has disposed of its no longer needed log file. In such cases no RM will claim ownership of the branch during replay, leaving the TM to assume that the owning RM is unavailable for some reason. With the name of the owning RM to hand, we now have the possibility to match against the name of the scanned RMs, conclude the branch belongs to one of the scanned RMs even though it pleaded ignorance, and hence dispose of it as a safely committed branch.

All of which should ensure that norecoveryxa's days in the spotlight are severely numbered.

1 comment:

Duncan Doyle said...

Awesome work. Have spend a lot of time in the past analyzing in-doubt XA transactions and had to go through a lot of hassle figuring out which XAResource was giving problems :-)

Which version of JBossTS (and JBoss AS/JBoss EAP) contains this improvement?